Mahdisadjadi / arxivscraper

A python module to scrape arxiv.org for a date range and category
MIT License
282 stars 53 forks source link

Issue with date_from and date_until #8

Open vgoel38 opened 4 years ago

vgoel38 commented 4 years ago

I copied the following url from the output of the program. The url looks for records between dates 2019-01-01 and 2019-05-10.

URL: http://export.arxiv.org/oai2?verb=ListRecords&from=2019-01-01&until=2019-05-10&metadataPrefix=arXiv&set=cs

But lot of records I got lie outside this date range (e.g. the first record which is from year 2007)

Am I missing something? I am not sure if the issue is with the code or with the arxiv api.

thecheeseontoast commented 4 years ago

Having the same issue, only 10% of the returned papers were within the requested date-range

Mahdisadjadi commented 4 years ago

@vgoel38 and @thecheeseontoast : Thank you for raising the issue. The scraper returns two date columns for each record:

If updates date is within the specified range, it still returns that record even when created date is out of the range. ArXiv specifically mentions this here:

Every OAI-PMH metadata record has a datestamp associated with it, which is the last modification time of that record. Because arXiv has updated metadata records in bulk on several occasions, the OAI-PMH datestamp values do not correspond with the original submission or replacement times for older articles, and may not for newer articles because of administrative and bibliographic updates. The earliest datestamp is given then the element of the Identify response.

If it would be something useful, I can slightly modify the behavior to use earliestDatestamp in addition to the last datastamp.

ChakreshIITGN commented 4 years ago

I notice that even some dates in the "updated" section are out of the range

Mahdisadjadi commented 4 years ago

@ChakreshIITGN That's right. The edit doesn't have to be done by the authors. When ArXiv runs a bulk job, it modifies the datastamps.

The OAI-PMH interface does not support selective harvesting based on submission date. The datestamps are designed to support incremental harvesting of updates on an ongoing basis. It is not possible to selectively harvest only, say, articles submitted in February 2001 (identifiers 0102.xxxx). Except for selective harvesting based on subject areas (see description of Sets below) the interface is designed to support copying and synchronization of a complete set of arXiv metadata. In order to harvest metadata for all articles, either make requests without a datestamp range (recommended), or make requests from the through to the present (but beware that because of bulk updates there are some dates on which there were large numbers of updates). [source]

I am not sure what is the best way to proceed but I'm considering various options.

valayDave commented 4 years ago

Hey. Great tool guys!. I found a bug with the Record._get_authors method where sometimes the author tag doesn't have forenames.

Bug Reproduction :

import arxivscraper
scraper = arxivscraper.Scraper(category='cs', date_from='2020-06-25',date_until='2020-06-27')

output = scraper.scrape()
Mahdisadjadi commented 4 years ago

@valayDave : Did you use pip to install or the repo?

valayDave commented 4 years ago

I installed with pip not from the source.

Mahdisadjadi commented 4 years ago

@valayDave Sorry pip version is lagging but this issue should be fixed in source.

Mahdisadjadi commented 4 years ago

@valayDave pip version is updated to the latest, so this bug should be fixed.

csrajath commented 4 years ago

@ChakreshIITGN That's right. The edit doesn't have to be done by the authors. When ArXiv runs a bulk job, it modifies the datastamps.

The OAI-PMH interface does not support selective harvesting based on submission date. The datestamps are designed to support incremental harvesting of updates on an ongoing basis. It is not possible to selectively harvest only, say, articles submitted in February 2001 (identifiers 0102.xxxx). Except for selective harvesting based on subject areas (see description of Sets below) the interface is designed to support copying and synchronization of a complete set of arXiv metadata. In order to harvest metadata for all articles, either make requests without a datestamp range (recommended), or make requests from the through to the present (but beware that because of bulk updates there are some dates on which there were large numbers of updates). [source]

I am not sure what is the best way to proceed but I'm considering various options.

@Mahdisadjadi One way to get around this which I thought of was: The get_metadata() method has a time key in its dictionary output for every record. This time is the original time of submission. Thus, we can pass the value of this key (time) as a conditional checker to from and until