OmnesRes / prepub

Production code for PrePubMed
http://www.prepubmed.org/
MIT License
17 stars 6 forks source link

Various database issues #4

Closed OmnesRes closed 7 years ago

OmnesRes commented 7 years ago

I discovered today that bioRxiv authors with an associated ORCID ID are not getting scraped correctly. I've also been aware for some time that arXiv q-bio titles and abstracts have hidden new line characters which will affect searches with double quoted phrases. I'm also aware of affiliation searches with advanced search potentially returning duplicated articles.

I need to change some of the indexing code and rebuild the database. I think a distinct() call on the advanced_search query set may fix the affiliation issue.

OmnesRes commented 7 years ago

The author issue was a false alarm. For some reason certain authors were not searchable, but upon inspection they were in the database. Reloading the server solved the problem. I don't think it's worth rebuilding the database for the arXiv q-bio issue at the moment, but I should at least try and stop indexing the new line characters.

OmnesRes commented 7 years ago

I did end up finding a small issue with author names. If someone enters first name last name, and the first name is in the database as both a first name and a last name, and the last name is in the database as a first name and as a last name, the current code doesn't currently identify the name. I think it's an easy fix.