Closed superaxander closed 2 weeks ago
@superaxander To reproduce it, can you share a SLR search string and the providers you selected? You can also write that to me in a private email. Bogus BibTeX is hard to find nowadays.
I'm pretty sure it's tripping up on this ArXiv entry https://arxiv.org/abs/2407.02238
This indeed contains a mismatched }
in M}ulti-modal
and if we get it using the ArXiv API it is unescaped in the XML data:
❯ curl "http://export.arxiv.org/api/query?search_query=all:MIREncoder&start=0&max_results=10"
<?xml version="1.0" encoding="UTF-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<link href="http://arxiv.org/api/query?search_query%3Dall%3AMIREncoder%26id_list%3D%26start%3D0%26max_results%3D10" rel="self" type="application/atom+xml"/>
<title type="html">ArXiv Query: search_query=all:MIREncoder&id_list=&start=0&max_results=10</title>
<id>http://arxiv.org/api/KINaKijc1q5cyb8/xzNuy2dHv6Y</id>
<updated>2024-09-07T00:00:00-04:00</updated>
<opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">1</opensearch:totalResults>
<opensearch:startIndex xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">0</opensearch:startIndex>
<opensearch:itemsPerPage xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">10</opensearch:itemsPerPage>
<entry>
<id>http://arxiv.org/abs/2407.02238v1</id>
<updated>2024-07-02T13:00:19Z</updated>
<published>2024-07-02T13:00:19Z</published>
<title>MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance
Optimizations</title>
<summary> One of the primary areas of interest in High Performance Computing is the
improvement of performance of parallel workloads. Nowadays, compilable source
code-based optimization tasks that employ deep learning often exploit LLVM
Intermediate Representations (IRs) for extracting features from source code.
Most such works target specific tasks, or are designed with a pre-defined set
of heuristics. So far, pre-trained models are rare in this domain, but the
possibilities have been widely discussed. Especially approaches mimicking
large-language models (LLMs) have been proposed. But these have prohibitively
large training costs. In this paper, we propose MIREncoder, a M}ulti-modal
IR-based Auto-Encoder that can be pre-trained to generate a learned embedding
space to be used for downstream tasks by machine learning-based approaches. A
multi-modal approach enables us to better extract features from compilable
programs. It allows us to better model code syntax, semantics and structure.
For code-based performance optimizations, these features are very important
while making optimization decisions. A pre-trained model/embedding implicitly
enables the usage of transfer learning, and helps move away from task-specific
trained models. Additionally, a pre-trained model used for downstream
performance optimization should itself have reduced overhead, and be easily
usable. These considerations have led us to propose a modeling approach that i)
understands code semantics and structure, ii) enables use of transfer learning,
and iii) is small and simple enough to be easily re-purposed or reused even
with low resource availability. Our evaluations will show that our proposed
approach can outperform the state of the art while reducing overhead.
</summary>
<author>
<name>Akash Dutta</name>
</author>
<author>
<name>Ali Jannesari</name>
</author>
<arxiv:comment xmlns:arxiv="http://arxiv.org/schemas/atom">12 pages, 6 figures, 9 tables, PACT '24 conference</arxiv:comment>
<link href="http://arxiv.org/abs/2407.02238v1" rel="alternate" type="text/html"/>
<link title="pdf" href="http://arxiv.org/pdf/2407.02238v1" rel="related" type="application/pdf"/>
<arxiv:primary_category xmlns:arxiv="http://arxiv.org/schemas/atom" term="cs.DC" scheme="http://arxiv.org/schemas/atom"/>
<category term="cs.DC" scheme="http://arxiv.org/schemas/atom"/>
<category term="cs.LG" scheme="http://arxiv.org/schemas/atom"/>
<category term="cs.PF" scheme="http://arxiv.org/schemas/atom"/>
</entry>
</feed>
The code at https://github.com/JabRef/jabref/blob/e801f4117fd62e4f5f42857c7b8d9135a90696fb/src/main/java/org/jabref/logic/importer/fetcher/ArXivFetcher.java#L686-L687 would have to check if the abstract can be put into bibtex or whether it needs escaping first.
Hope this helps!
Ah thanks a lot yeah we need to run it through our fomatters LatexCleanupFormatter
JabRef version
5.15 (latest release)
Operating system
Windows
Details on version and operating system
No response
Checked with the latest development build (copy version output from About dialog)
Steps to reproduce the behaviour
(I'm also having a different problem where starting a systematic literature review crashes with a git problem where it complains it cannot amend a commit if there are no commits yet. To reproduce the above what I actually have to do is create the study and let it crash then create an empty bib file called studyResult.bib, create an initial git commit myself, open the bib file and click "update study search results")
Appendix
Log File
``` <-- start of log removed since it was too long --> Failed to push Failed to push Could not checkout search branch. Searching... Invalid cookie for https://www.researchgate.net/search.Search.html?type=publication&query=LLVM%20concurrency: Invalid cookie for https://www.researchgate.net/search.Search.html?type=publication&query=LLVM%20semantics: Invalid cookie for https://www.researchgate.net/search.Search.html?type=publication&query=LLVM%20%28verifier%20verification%20checker%20checking%29: Invalid cookie for https://www.researchgate.net/search.Search.html?type=publication&query=LLVM%20%22separation%20logic%22: <-- I've cut out a bunch of failed requests here --> HTTP 404, details: Not Found, Failed to fetch future BibEntry with id '10.1145/3689727' (skipping merge). org.jabref.logic.importer.FetcherClientException: Encountered HTTP 404 Not Found at org.jabref@5.15.60000/org.jabref.logic.net.URLDownload.openConnection(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.importer.fetcher.DoiFetcher.performSearchById(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.importer.fetcher.DoiFetcher.lambda$asyncPerformSearchById$0(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(Unknown Source) at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) HTTP 404, details: Not Found, Failed to fetch future BibEntry with id '10.1145/3689727' (skipping merge). org.jabref.logic.importer.FetcherClientException: Encountered HTTP 404 Not Found at org.jabref@5.15.60000/org.jabref.logic.net.URLDownload.openConnection(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.importer.fetcher.DoiFetcher.performSearchById(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.importer.fetcher.DoiFetcher.lambda$asyncPerformSearchById$0(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(Unknown Source) at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) HTTP 404, details: Not Found, Failed to fetch future BibEntry with id '10.23638/LMCS-17(1:15)2021' (skipping merge). org.jabref.logic.importer.FetcherClientException: Encountered HTTP 404 Not Found at org.jabref@5.15.60000/org.jabref.logic.net.URLDownload.openConnection(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.importer.fetcher.DoiFetcher.performSearchById(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.importer.fetcher.DoiFetcher.lambda$asyncPerformSearchById$0(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) at java.base/java.util.concurrent.CompletableFuture$AsyncSupply.exec(Unknown Source) at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) Failed to push Failed to push Unescaped '}' character without opening bracket ends string prematurely. Field value: One of the primary areas of interest in High Performance Computing is the improvement of performance of parallel workloads. Nowadays, compilable source code-based optimization tasks that employ deep learning often exploit LLVM Intermediate Representations (IRs) for extracting features from source code. Most such works target specific tasks, or are designed with a pre-defined set of heuristics. So far, pre-trained models are rare in this domain, but the possibilities have been widely discussed. Especially approaches mimicking large-language models (LLMs) have been proposed. But these have prohibitively large training costs. In this paper, we propose MIREncoder, a M}ulti-modal IR-based Auto-Encoder that can be pre-trained to generate a learned embedding space to be used for downstream tasks by machine learning-based approaches. A multi-modal approach enables us to better extract features from compilable programs. It allows us to better model code syntax, semantics and structure. For code-based performance optimizations, these features are very important while making optimization decisions. A pre-trained model/embedding implicitly enables the usage of transfer learning, and helps move away from task-specific trained models. Additionally, a pre-trained model used for downstream performance optimization should itself have reduced overhead, and be easily usable. These considerations have led us to propose a modeling approach that i) understands code semantics and structure, ii) enables use of transfer learning, and iii) is small and simple enough to be easily re-purposed or reused even with low resource availability. Our evaluations will show that our proposed approach can outperform the state of the art while reducing overhead. Invalid field value One of the primary areas of interest in High Performance Computing is the improvement of performance of parallel workloads. Nowadays, compilable source code-based optimization tasks that employ deep learning often exploit LLVM Intermediate Representations (IRs) for extracting features from source code. Most such works target specific tasks, or are designed with a pre-defined set of heuristics. So far, pre-trained models are rare in this domain, but the possibilities have been widely discussed. Especially approaches mimicking large-language models (LLMs) have been proposed. But these have prohibitively large training costs. In this paper, we propose MIREncoder, a M}ulti-modal IR-based Auto-Encoder that can be pre-trained to generate a learned embedding space to be used for downstream tasks by machine learning-based approaches. A multi-modal approach enables us to better extract features from compilable programs. It allows us to better model code syntax, semantics and structure. For code-based performance optimizations, these features are very important while making optimization decisions. A pre-trained model/embedding implicitly enables the usage of transfer learning, and helps move away from task-specific trained models. Additionally, a pre-trained model used for downstream performance optimization should itself have reduced overhead, and be easily usable. These considerations have led us to propose a modeling approach that i) understands code semantics and structure, ii) enables use of transfer learning, and iii) is small and simple enough to be easily re-purposed or reused even with low resource availability. Our evaluations will show that our proposed approach can outperform the state of the art while reducing overhead. of field ABSTRACT of entry {] org.jabref.logic.bibtex.InvalidFieldValueException: Unescaped '}' character without opening bracket ends string prematurely. Field value: One of the primary areas of interest in High Performance Computing is the improvement of performance of parallel workloads. Nowadays, compilable source code-based optimization tasks that employ deep learning often exploit LLVM Intermediate Representations (IRs) for extracting features from source code. Most such works target specific tasks, or are designed with a pre-defined set of heuristics. So far, pre-trained models are rare in this domain, but the possibilities have been widely discussed. Especially approaches mimicking large-language models (LLMs) have been proposed. But these have prohibitively large training costs. In this paper, we propose MIREncoder, a M}ulti-modal IR-based Auto-Encoder that can be pre-trained to generate a learned embedding space to be used for downstream tasks by machine learning-based approaches. A multi-modal approach enables us to better extract features from compilable programs. It allows us to better model code syntax, semantics and structure. For code-based performance optimizations, these features are very important while making optimization decisions. A pre-trained model/embedding implicitly enables the usage of transfer learning, and helps move away from task-specific trained models. Additionally, a pre-trained model used for downstream performance optimization should itself have reduced overhead, and be easily usable. These considerations have led us to propose a modeling approach that i) understands code semantics and structure, ii) enables use of transfer learning, and iii) is small and simple enough to be easily re-purposed or reused even with low resource availability. Our evaluations will show that our proposed approach can outperform the state of the art while reducing overhead. at org.jabref@5.15.60000/org.jabref.logic.bibtex.FieldWriter.checkBraces(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.bibtex.FieldWriter.formatWithoutResolvingStrings(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.bibtex.FieldWriter.write(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.bibtex.BibEntryWriter.writeField(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.bibtex.BibEntryWriter.writeRequiredFieldsFirstRemainingFieldsSecond(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.bibtex.BibEntryWriter.write(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.exporter.BibtexDatabaseWriter.writeEntry(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.exporter.BibDatabaseWriter.savePartOfDatabase(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.exporter.BibDatabaseWriter.saveDatabase(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.crawler.StudyRepository.writeResultToFile(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.crawler.StudyRepository.persistResults(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.crawler.StudyRepository.persist(Unknown Source) at org.jabref@5.15.60000/org.jabref.logic.crawler.Crawler.performCrawl(Unknown Source) at org.jabref@5.15.60000/org.jabref.gui.slr.ExistingStudySearchAction.lambda$crawl$0(Unknown Source) at org.jabref@5.15.60000/org.jabref.gui.util.BackgroundTask$1.call(Unknown Source) at org.jabref@5.15.60000/org.jabref.gui.util.UiTaskExecutor$1.call(Unknown Source) at javafx.graphics@22.0.1/javafx.concurrent.Task$TaskCallable.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) at java.base/java.util.concurrent.FutureTask.run(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) at java.base/java.lang.Thread.run(Unknown Source) Error during persistence of crawling results. index exceeds maxCellCount. Check size calculations for class org.jabref.gui.errorconsole.ErrorConsoleView$1 index exceeds maxCellCount. Check size calculations for class org.jabref.gui.errorconsole.ErrorConsoleView$1 index exceeds maxCellCount. Check size calculations for class org.jabref.gui.errorconsole.ErrorConsoleView$1 index exceeds maxCellCount. Check size calculations for class org.jabref.gui.errorconsole.ErrorConsoleView$1 index exceeds maxCellCount. Check size calculations for class org.jabref.gui.errorconsole.ErrorConsoleView$1 index exceeds maxCellCount. Check size calculations for class org.jabref.gui.errorconsole.ErrorConsoleView$1 ```