benjlis / foiarchive-search

Streamlit for FOIArchive search GUI
MIT License
2 stars 0 forks source link

FRUS documents lacking dates #1

Closed joewiz closed 5 months ago

joewiz commented 6 months ago

Hi! I just stumbled onto https://foiarchive-search.streamlit.app/. Very nice!

I noticed that when the selected corpus is FRUS, there are 22k documents whose date is "null." This is unexpected, since every document in the FRUS corpus has a machine readable date (or really, a range of max and min datetimes). I'm not sure if this is the right repository to file this issue, but I'd be grateful if you could pass this on to whoever may be responsible for the extraction of metadata from FRUS. I'm happy to provide pointers if needed.

Screenshot 2024-02-24 at 10 00 42
joewiz commented 6 months ago

p.s. Also, the table below lists the number of pages in the FRUS corpus as "None," but this page information is captured in the FRUS TEI source files. And it lists the number of documents in the FRUS corpus as 209k, but it is actually ~310k.

Screenshot 2024-02-24 at 10 05 12
joewiz commented 6 months ago

Regarding the # of documents, I see that https://foiarchive-search.streamlit.app/Overview explains:

Our processed collection currently contains volumes from 1930-1980, including some volumes from 1861 and 1980-1984.

But still, date and page information is definitely available in the sources, and I would be happy to point you to the locations of this information in the FRUS TEI XML.

joewiz commented 6 months ago

As long as I've strayed beyond the original topic, I noticed this line in https://doi.org/10.1177/0738894220930326:

We also introduce an online platform with an application programming interface (API) and website which scholars can use to search for, read through and download datasets customized to their research needs.

I have been unable to locate history-lab.org's API. If the API exists, could you point me to it? Thanks!

benjlis commented 6 months ago

Thank you for raising these data issues and providing detailed supporting information, @joewiz. I'm sharing this with our team and will keep this issue open until we resolve it. For future reference, you can raise data issues by clicking the "contact us" link on the app's sidebar. Doing so raises an email to info@history-lab.org. I will also get you the API details.

benjlis commented 5 months ago

We've updated our copy of FRUS with the latest data available at https://github.com/HistoryAtState/frus. It now contains 311,866 documents. We've also incorporated page counts based on the information in the TEI files and significantly reduced the number of records without dates.

The next time we perform a FRUS download, we will ensure that our code extracts all dates. We will also contact you for guidance on best practices. In the meantime, we've noted in our FRUS corpus description that we are missing some dates due to our processing.

joewiz commented 5 months ago

@benjlis Thank you so much! I'd be happy to answer any questions about FRUS.