Open vgoel38 opened 4 years ago
Having the same issue, only 10% of the returned papers were within the requested date-range
@vgoel38 and @thecheeseontoast : Thank you for raising the issue. The scraper returns two date columns for each record:
If updates
date is within the specified range, it still returns that record even when created
date is out of the range. ArXiv specifically mentions this here:
Every OAI-PMH metadata record has a datestamp associated with it, which is the last modification time of that record. Because arXiv has updated metadata records in bulk on several occasions, the OAI-PMH datestamp values do not correspond with the original submission or replacement times for older articles, and may not for newer articles because of administrative and bibliographic updates. The earliest datestamp is given then the
element of the Identify response.
If it would be something useful, I can slightly modify the behavior to use earliestDatestamp
in addition to the last datastamp.
I notice that even some dates in the "updated" section are out of the range
@ChakreshIITGN That's right. The edit doesn't have to be done by the authors. When ArXiv runs a bulk job, it modifies the datastamps.
The OAI-PMH interface does not support selective harvesting based on submission date. The datestamps are designed to support incremental harvesting of updates on an ongoing basis. It is not possible to selectively harvest only, say, articles submitted in February 2001 (identifiers 0102.xxxx). Except for selective harvesting based on subject areas (see description of Sets below) the interface is designed to support copying and synchronization of a complete set of arXiv metadata. In order to harvest metadata for all articles, either make requests without a datestamp range (recommended), or make requests from the
through to the present (but beware that because of bulk updates there are some dates on which there were large numbers of updates). [source]
I am not sure what is the best way to proceed but I'm considering various options.
Hey. Great tool guys!. I found a bug with the Record._get_authors
method where sometimes the author
tag doesn't have forenames
.
Bug Reproduction :
import arxivscraper
scraper = arxivscraper.Scraper(category='cs', date_from='2020-06-25',date_until='2020-06-27')
output = scraper.scrape()
@valayDave : Did you use pip
to install or the repo?
I installed with pip not from the source.
@valayDave Sorry pip
version is lagging but this issue should be fixed in source.
@valayDave pip
version is updated to the latest, so this bug should be fixed.
@ChakreshIITGN That's right. The edit doesn't have to be done by the authors. When ArXiv runs a bulk job, it modifies the datastamps.
The OAI-PMH interface does not support selective harvesting based on submission date. The datestamps are designed to support incremental harvesting of updates on an ongoing basis. It is not possible to selectively harvest only, say, articles submitted in February 2001 (identifiers 0102.xxxx). Except for selective harvesting based on subject areas (see description of Sets below) the interface is designed to support copying and synchronization of a complete set of arXiv metadata. In order to harvest metadata for all articles, either make requests without a datestamp range (recommended), or make requests from the through to the present (but beware that because of bulk updates there are some dates on which there were large numbers of updates). [source]
I am not sure what is the best way to proceed but I'm considering various options.
@Mahdisadjadi One way to get around this which I thought of was: The get_metadata()
method has a time
key in its dictionary output for every record. This time is the original time of submission. Thus, we can pass the value of this key (time
) as a conditional checker to from
and until
I copied the following url from the output of the program. The url looks for records between dates 2019-01-01 and 2019-05-10.
URL: http://export.arxiv.org/oai2?verb=ListRecords&from=2019-01-01&until=2019-05-10&metadataPrefix=arXiv&set=cs
But lot of records I got lie outside this date range (e.g. the first record which is from year 2007)
Am I missing something? I am not sure if the issue is with the code or with the arxiv api.