llama-index PDFReader is ignoring all the metadata of a scientific paper

Quantisan commented 1 year ago

https://github.com/jerryjliu/llama_index/blob/3cf8e1ace43787812d2e7e2c8f2ac7dfad0e6972/llama_index/readers/file/docs_reader.py#L35-L44

Scientific papers typically follow a consistent format and have metadata such as the title, abstract, and sections. However, these elements are not extracted by the llama-index tool because it is designed to work with generic PDF files.

Quantisan commented 1 year ago

what's the current golden standard in scientific PDF text + data extraction?

A few methods I came across: LayoutLM, Unstructured.io, grobid (Elicit uses)

Quantisan commented 1 year ago

this is valuable meta data because, for example, Elicit only uses the title and abstract portion of a paper in their search algo https://elicit.org/faq#appendix-how-does-elicit-work

Quantisan commented 1 year ago

https://gpt-index.readthedocs.io/en/latest/core_modules/data_modules/index/metadata_extraction.html

cbfrance commented 1 year ago

Interesting thread, thanks for posting the breadcrumbs.

One thing that really caught my eye: QuestionsAnsweredExtractor Responds with "questions answered" by the text? I would love to see the output of this on a test PDF. Hopefully these "suggested questions" would have high accuracy. And also they could just give us a sense of what is possible at baseline. If we did a test to get a list of these questions, I can think of a few potential implications. For example:

UX research: We could put those questions to the team and ask them which questions make sense.
UI feature: These questions might be useful as automatically-generated suggested searches.

@Quantisan I will try to run it on a single pdf, let me know if you have already tested this out or if you have other ideas.

cbfrance commented 1 year ago

So if I understand correctly, llamaindex probably does suitable metadata extraction, but it requires us to do extraction step with functions in llama_index.node_parser.extractors. Such as QuestionsAnsweredExtractor. So if that is right then I guess this can be renamed to a feature like "Add metadata extraction".

Also this raises a couple questions for me:

❓if we are building a metadata object can we do any simple queries to third party services (eg. with the DOI) to get easier access to canonical metadata about the document? Or, do we even have the DOI? (Scope creep but also might be simpler than other strategies like full OCR.)

❓Will users be able to filter/sort by metadata — does llamaindex have any stock UI for browsing conversational search results? Or is this all just in service to better vectorization? It seems like users will want fine-grained control over this like any standard search interface, but maybe this can be handled in natural language just with healthy vectors+metadata.

Quantisan commented 1 year ago

Interesting thread, thanks for posting the breadcrumbs.

One thing that really caught my eye: QuestionsAnsweredExtractor Responds with "questions answered" by the text? I would love to see the output of this on a test PDF. Hopefully these "suggested questions" would have high accuracy. And also they could just give us a sense of what is possible at baseline. If we did a test to get a list of these questions, I can think of a few potential implications. For example:

UX research: We could put those questions to the team and ask them which questions make sense.

UI feature: These questions might be useful as automatically-generated suggested searches.

@Quantisan I will try to run it on a single pdf, let me know if you have already tested this out or if you have other ideas.

Good find! Here's the prompt they're using https://github.com/jerryjliu/llama_index/blob/main/llama_index/node_parser/extractors/metadata_extractors.py#L299-L311. We can test with chunks of context ourselves directly with LLM. I recall this is a feature with Humata, questions suggestion.

Quantisan commented 1 year ago

gave GROBID a quick spin with one of Steve's PDFs. Here's the extracted full text + more metadata than I can dream of. https://gist.github.com/Quantisan/7e1453c55f90535f1816f54b87d8f2c6

I can see why it's such a recommended tool in this niche space.

As an aside, GROBID doc mentions LayoutLM and noted that Transformer approaches are very costly to run but not yielding significantly better result yet. https://grobid.readthedocs.io/en/latest/Principles/ In any case, if GROBID performs this well throughout, it's more than good enough for now.

Quantisan commented 1 year ago

So if I understand correctly, llamaindex probably does suitable metadata extraction, but it requires us to do extraction step with functions in llama_index.node_parser.extractors. Such as QuestionsAnsweredExtractor. So if that is right then I guess this can be renamed to a feature like "Add metadata extraction".

Also this raises a couple questions for me:

❓if we are building a metadata object can we do any simple queries to third party services (eg. with the DOI) to get easier access to canonical metadata about the document? Or, do we even have the DOI? (Scope creep but also might be simpler than other strategies like full OCR.)

❓Will users be able to filter/sort by metadata — does llamaindex have any stock UI for browsing conversational search results? Or is this all just in service to better vectorization? It seems like users will want fine-grained control over this like any standard search interface, but maybe this can be handled in natural language just with healthy vectors+metadata.

@cbfrance I'm working on this ticket today, which will answer (1).

For (2), the metadata are attached to the Nodes, so when we perform the information retrieval step, we can apply those user-defined meta filters before feeding the list of nodes to a LLM for querying.

TheDataGuild / mind-palace

llama-index PDFReader is ignoring all the metadata of a scientific paper #9