Error during persistence of crawiling results

superaxander commented 2 weeks ago

JabRef version

5.15 (latest release)

Operating system

Windows

Details on version and operating system

No response

Checked with the latest development build (copy version output from About dialog)

[X] I made a backup of my libraries before testing the latest development version.
[X] I have tested the latest development version and the problem persists

Steps to reproduce the behaviour

Start a systematic literature review
After waiting a while I get an error: "Error during persistence of crawling results" and something about an unescaped } character likely from one of the papers it finds
The process stops instead of simply skipping this one file

(I'm also having a different problem where starting a systematic literature review crashes with a git problem where it complains it cannot amend a commit if there are no commits yet. To reproduce the above what I actually have to do is create the study and let it crash then create an empty bib file called studyResult.bib, create an initial git commit myself, open the bib file and click "update study search results")

Appendix

Log File

``` <-- start of log removed since it was too long --> Failed to push Failed to push Could not checkout search branch. Searching... Invalid cookie for https://www.researchgate.net/search.Search.html?type=publication&query=LLVM%20concurrency: Invalid cookie for https://www.researchgate.net/search.Search.html?type=publication&query=LLVM%20semantics: Invalid cookie for https://www.researchgate.net/search.Search.html?type=publication&query=LLVM%20%28verifier%20verification%20checker%20checking%29: Invalid cookie for https://www.researchgate.net/search.Search.html?type=publication&query=LLVM%20%22separation%20logic%22: <-- I've cut out a bunch of failed requests here --> HTTP 404, details: Not Found, Failed to fetch future BibEntry with id '10.1145/3689727' (skipping merge). org.jabref.logic.importer.FetcherClientException: Encountered HTTP 404 Not Found at org.jabref@5.15.60000/org.jabref.logic.net.URLDownload.openConnection(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.importer.fetcher.DoiFetcher.performSearchById(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.importer.fetcher.DoiFetcher.lambda$asyncPerformSearchById$0(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(Unknown Source) at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) HTTP 404, details: Not Found, Failed to fetch future BibEntry with id '10.1145/3689727' (skipping merge). org.jabref.logic.importer.FetcherClientException: Encountered HTTP 404 Not Found at org.jabref@5.15.60000/org.jabref.logic.net.URLDownload.openConnection(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.importer.fetcher.DoiFetcher.performSearchById(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.importer.fetcher.DoiFetcher.lambda$asyncPerformSearchById$0(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(Unknown Source) at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) HTTP 404, details: Not Found, Failed to fetch future BibEntry with id '10.23638/LMCS-17(1:15)2021' (skipping merge). org.jabref.logic.importer.FetcherClientException: Encountered HTTP 404 Not Found at org.jabref@5.15.60000/org.jabref.logic.net.URLDownload.openConnection(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.importer.fetcher.DoiFetcher.performSearchById(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.importer.fetcher.DoiFetcher.lambda$asyncPerformSearchById$0(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(Unknown Source) at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) Failed to push Failed to push Unescaped '}' character without opening bracket ends string prematurely. Field value: One of the primary areas of interest in High Performance Computing is the improvement of performance of parallel workloads. Nowadays, compilable source code-based optimization tasks that employ deep learning often exploit LLVM Intermediate Representations (IRs) for extracting features from source code. Most such works target specific tasks, or are designed with a pre-defined set of heuristics. So far, pre-trained models are rare in this domain, but the possibilities have been widely discussed. Especially approaches mimicking large-language models (LLMs) have been proposed. But these have prohibitively large training costs. In this paper, we propose MIREncoder, a M}ulti-modal IR-based Auto-Encoder that can be pre-trained to generate a learned embedding space to be used for downstream tasks by machine learning-based approaches. A multi-modal approach enables us to better extract features from compilable programs. It allows us to better model code syntax, semantics and structure. For code-based performance optimizations, these features are very important while making optimization decisions. A pre-trained model/embedding implicitly enables the usage of transfer learning, and helps move away from task-specific trained models. Additionally, a pre-trained model used for downstream performance optimization should itself have reduced overhead, and be easily usable. These considerations have led us to propose a modeling approach that i) understands code semantics and structure, ii) enables use of transfer learning, and iii) is small and simple enough to be easily re-purposed or reused even with low resource availability. Our evaluations will show that our proposed approach can outperform the state of the art while reducing overhead. Invalid field value One of the primary areas of interest in High Performance Computing is the improvement of performance of parallel workloads. Nowadays, compilable source code-based optimization tasks that employ deep learning often exploit LLVM Intermediate Representations (IRs) for extracting features from source code. Most such works target specific tasks, or are designed with a pre-defined set of heuristics. So far, pre-trained models are rare in this domain, but the possibilities have been widely discussed. Especially approaches mimicking large-language models (LLMs) have been proposed. But these have prohibitively large training costs. In this paper, we propose MIREncoder, a M}ulti-modal IR-based Auto-Encoder that can be pre-trained to generate a learned embedding space to be used for downstream tasks by machine learning-based approaches. A multi-modal approach enables us to better extract features from compilable programs. It allows us to better model code syntax, semantics and structure. For code-based performance optimizations, these features are very important while making optimization decisions. A pre-trained model/embedding implicitly enables the usage of transfer learning, and helps move away from task-specific trained models. Additionally, a pre-trained model used for downstream performance optimization should itself have reduced overhead, and be easily usable. These considerations have led us to propose a modeling approach that i) understands code semantics and structure, ii) enables use of transfer learning, and iii) is small and simple enough to be easily re-purposed or reused even with low resource availability. Our evaluations will show that our proposed approach can outperform the state of the art while reducing overhead. of field ABSTRACT of entry {] org.jabref.logic.bibtex.InvalidFieldValueException: Unescaped '}' character without opening bracket ends string prematurely. Field value: One of the primary areas of interest in High Performance Computing is the improvement of performance of parallel workloads. Nowadays, compilable source code-based optimization tasks that employ deep learning often exploit LLVM Intermediate Representations (IRs) for extracting features from source code. Most such works target specific tasks, or are designed with a pre-defined set of heuristics. So far, pre-trained models are rare in this domain, but the possibilities have been widely discussed. Especially approaches mimicking large-language models (LLMs) have been proposed. But these have prohibitively large training costs. In this paper, we propose MIREncoder, a M}ulti-modal IR-based Auto-Encoder that can be pre-trained to generate a learned embedding space to be used for downstream tasks by machine learning-based approaches. A multi-modal approach enables us to better extract features from compilable programs. It allows us to better model code syntax, semantics and structure. For code-based performance optimizations, these features are very important while making optimization decisions. A pre-trained model/embedding implicitly enables the usage of transfer learning, and helps move away from task-specific trained models. Additionally, a pre-trained model used for downstream performance optimization should itself have reduced overhead, and be easily usable. These considerations have led us to propose a modeling approach that i) understands code semantics and structure, ii) enables use of transfer learning, and iii) is small and simple enough to be easily re-purposed or reused even with low resource availability. Our evaluations will show that our proposed approach can outperform the state of the art while reducing overhead. at org.jabref@5.15.60000/org.jabref.logic.bibtex.FieldWriter.checkBraces(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.bibtex.FieldWriter.formatWithoutResolvingStrings(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.bibtex.FieldWriter.write(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.bibtex.BibEntryWriter.writeField(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.bibtex.BibEntryWriter.writeRequiredFieldsFirstRemainingFieldsSecond(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.bibtex.BibEntryWriter.write(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.exporter.BibtexDatabaseWriter.writeEntry(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.exporter.BibDatabaseWriter.savePartOfDatabase(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.exporter.BibDatabaseWriter.saveDatabase(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.crawler.StudyRepository.writeResultToFile(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.crawler.StudyRepository.persistResults(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.crawler.StudyRepository.persist(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.crawler.Crawler.performCrawl(Unknown Source) at org.jabref@5.15.60000/org.jabref.gui.slr.ExistingStudySearchAction.lambda$crawl$0(Unknown Source) at org.jabref@5.15.60000/org.jabref.gui.util.BackgroundTask$1.call(Unknown Source) at org.jabref@5.15.60000/org.jabref.gui.util.UiTaskExecutor$1.call(Unknown Source) at javafx.graphics@22.0.1/javafx.concurrent.Task$TaskCallable.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Error during persistence of crawling results. index exceeds maxCellCount. Check size calculations for class org.jabref.gui.errorconsole.ErrorConsoleView$1 index exceeds maxCellCount. Check size calculations for class org.jabref.gui.errorconsole.ErrorConsoleView$1 index exceeds maxCellCount. Check size calculations for class org.jabref.gui.errorconsole.ErrorConsoleView$1 index exceeds maxCellCount. Check size calculations for class org.jabref.gui.errorconsole.ErrorConsoleView$1 index exceeds maxCellCount. Check size calculations for class org.jabref.gui.errorconsole.ErrorConsoleView$1 index exceeds maxCellCount. Check size calculations for class org.jabref.gui.errorconsole.ErrorConsoleView$1 ```

koppor commented 2 weeks ago

@superaxander To reproduce it, can you share a SLR search string and the providers you selected? You can also write that to me in a private email. Bogus BibTeX is hard to find nowadays.

superaxander commented 2 weeks ago

I'm pretty sure it's tripping up on this ArXiv entry https://arxiv.org/abs/2407.02238 This indeed contains a mismatched } in M}ulti-modal and if we get it using the ArXiv API it is unescaped in the XML data:

❯ curl "http://export.arxiv.org/api/query?search_query=all:MIREncoder&start=0&max_results=10"
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <link href="http://arxiv.org/api/query?search_query%3Dall%3AMIREncoder%26id_list%3D%26start%3D0%26max_results%3D10" rel="self" type="application/atom+xml"/>
  <title type="html">ArXiv Query: search_query=all:MIREncoder&amp;id_list=&amp;start=0&amp;max_results=10</title>
  <id>http://arxiv.org/api/KINaKijc1q5cyb8/xzNuy2dHv6Y</id>
  <updated>2024-09-07T00:00:00-04:00</updated>
  <opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1</opensearch:totalResults>
  <opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
  <opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">10</opensearch:itemsPerPage>
  <entry>
    <id>http://arxiv.org/abs/2407.02238v1</id>
    <updated>2024-07-02T13:00:19Z</updated>
    <published>2024-07-02T13:00:19Z</published>
    <title>MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance
  Optimizations</title>
    <summary>  One of the primary areas of interest in High Performance Computing is the
improvement of performance of parallel workloads. Nowadays, compilable source
code-based optimization tasks that employ deep learning often exploit LLVM
Intermediate Representations (IRs) for extracting features from source code.
Most such works target specific tasks, or are designed with a pre-defined set
of heuristics. So far, pre-trained models are rare in this domain, but the
possibilities have been widely discussed. Especially approaches mimicking
large-language models (LLMs) have been proposed. But these have prohibitively
large training costs. In this paper, we propose MIREncoder, a M}ulti-modal
IR-based Auto-Encoder that can be pre-trained to generate a learned embedding
space to be used for downstream tasks by machine learning-based approaches. A
multi-modal approach enables us to better extract features from compilable
programs. It allows us to better model code syntax, semantics and structure.
For code-based performance optimizations, these features are very important
while making optimization decisions. A pre-trained model/embedding implicitly
enables the usage of transfer learning, and helps move away from task-specific
trained models. Additionally, a pre-trained model used for downstream
performance optimization should itself have reduced overhead, and be easily
usable. These considerations have led us to propose a modeling approach that i)
understands code semantics and structure, ii) enables use of transfer learning,
and iii) is small and simple enough to be easily re-purposed or reused even
with low resource availability. Our evaluations will show that our proposed
approach can outperform the state of the art while reducing overhead.
</summary>
    <author>
      <name>Akash Dutta</name>
    </author>
    <author>
      <name>Ali Jannesari</name>
    </author>
    <arxiv:comment xmlns:arxiv="http://arxiv.org/schemas/atom">12 pages, 6 figures, 9 tables, PACT '24 conference</arxiv:comment>
    <link href="http://arxiv.org/abs/2407.02238v1" rel="alternate" type="text/html"/>
    <link title="pdf" href="http://arxiv.org/pdf/2407.02238v1" rel="related" type="application/pdf"/>
    <arxiv:primary_category xmlns:arxiv="http://arxiv.org/schemas/atom" term="cs.DC" scheme="http://arxiv.org/schemas/atom"/>
    <category term="cs.DC" scheme="http://arxiv.org/schemas/atom"/>
    <category term="cs.LG" scheme="http://arxiv.org/schemas/atom"/>
    <category term="cs.PF" scheme="http://arxiv.org/schemas/atom"/>
  </entry>
</feed>

The code at https://github.com/JabRef/jabref/blob/e801f4117fd62e4f5f42857c7b8d9135a90696fb/src/main/java/org/jabref/logic/importer/fetcher/ArXivFetcher.java#L686-L687 would have to check if the abstract can be put into bibtex or whether it needs escaping first.

Hope this helps!

Siedlerchr commented 2 weeks ago

Ah thanks a lot yeah we need to run it through our fomatters LatexCleanupFormatter

JabRef / jabref