Closed dhimmel closed 7 years ago
Thus far I have not had any collaborators on PrePubMed and as a result just used a format that was easy for me to use. Sometimes ASAPbio will use some of the data in PrePubMed and I've asked them if the files were okay and they didn't seem to mind.
I think I will continue to generate the current format for my own (and PrePubMed's) use, but there is no reason I couldn't start creating files which are meant to be used by third parties. If you describe or provide code for an optimal format for you I could start generating that file. I use Python 2.7 by the way, sorry.
I use Python 2.7 by the way, sorry.
This is the first time I've seem someone apologize for not using python 3 :smile_cat: The Times They Are a-Changin'. Just in case there's any confusion, my original comment applies equally to python 2 & 3.
I think I will continue to generate the current format for my own (and PrePubMed's) use
If you added header information and a license field to the current file, I wouldn't need anything more.
I think I'm just going to go ahead and have PrePubMed generate a TSV like the one you made. Do you want additional information in that file or exactly like you have it?
I think I'm just going to go ahead and have PrePubMed generate a TSV like the one you made. Do you want additional information in that file or exactly like you have it?
Well I think the additional information may come in handy. But given all the compound fields (fields that are actually lists), JSON may be a better format. It would be easy to convert from JSON to dataframe using pandas
.
But up to you, of course. A TSV with essential columns would definitely be helpful for many users.
In general my three recommendations are:
csv
, json
, or pandas
(not built in).I am familiar with JSON and like it since I love dictionaries, but I'm not sure every interested party will know how to deal with JSON.
Is there a reason I can't just have a TSV with list entries separated with commas like I did in the file shared with Twitter?
Also, do you know how to get Python to write unicode to a file? I'll google it.
Is there a reason I can't just have a TSV with list entries separated with commas like I did in the file shared with Twitter?
This could get complicated when fields contain commas. I usually use |
to delimit within fields for that reason. The problem is that there isn't really a standard and thus some manual investigation and post-processing is needed after reading in the file. But I do still think TSV is a valuable format.
Also, do you know how to get Python to write unicode to a file? I'll google it.
In 3 it would just work. In 2.7, I think it will also just work... the file will need to be utf-8
encoded.
Also, do you know how to get Python to write unicode to a file? I'll google it.
According to http://stackoverflow.com/a/19591815/4651668, use io.open
with encoding='utf8
.
Hmm, I'm looking into it right now. The only entries that I allow to contain unicode characters are Titles and Abstracts. I don't know if you want these in the TSV file or not.
Hmm, I suppose affiliation might also have non-ascii in it. Could be interesting to see if affiliation affects license type.
What about names? There are many non-ascii names in science such as Antoine Lizée. I've found life is much easier when all text is unicode. Not sure how hard that would be in 2.7. If you're interested in a late night, you could look into migrating to python 3.
Also I'm happy to review any changes if you'd like. Just submit a pull request to your repo and mention me.
Yep, yep, making PrePubMed was the first time I discovered the horrors of non-ASCII characters. As described at http://www.prepubmed.org/help/ I store author names with ASCII characters only:
Will this upset the author who is trying to search for their own name how it supposed to be written? Probably. Will someone else searching for the name type the name out with only ASCII characters? Hopefully.
Does PubMed allow non-ASCII characters in names?
I guess ideally my search would recognize a name regardless of how people search for it, but seems like a lot of effort for such a fringe case and the small number of users I have.
Currently I do this: import unicodedata unicodedata.normalize('NFKD','author').encode('ascii','ignore')
If this is a horrible thing to do I apologize.
I'm currently working on a giant project so I don't really have time to make major overhauls to PrePubMed, which may become obsolete if ASAPbio creates its Central Service.
I'm currently working on a giant project so I don't really have time to make major overhauls to PrePubMed, which may become obsolete if ASAPbio creates its Central Service.
Totally understand -- your solution makes sense to me!
Using io.open seems legit. I'll make a TSV file with the doi, date (in a Python datetime format), subjects, license, version, title, affiliations. I'll separate subjects and affiliations with "|".
I noticed that some preprints which previously had multiple subjects now only have a single subject if I visit the page, not sure if you want those fixed or not.
The title will allow you to remove duplicated preprints if you desire. The affiliation is a little tricky. I just grab all affiliations listed. When using affiliation to determine its affect on license type it may be best to only use the affiliation of the corresponding author (this information would have to be indexed).
This is somewhat unrelated, but ASAPbio had me grab all the email addresses on bioRxiv, which was like 20K. Not sure if you want those for anything. You might be surprised by the number of @aol addresses.
I now have a TSV UTF-8 encoded file: biorxiv_licenses.tsv I'm closing the issue.
I noticed that some preprints which previously had multiple subjects now only have a single subject if I visit the page, not sure if you want those fixed or not.
Whatever PrePubMed decides on we'll go with! I'm fine with outsourcing all of these hard decisions. If all preprints now only have a single subject, this may be worth updating, as it changes subjects
to subject
and will affect how we analyze the data.
I now have a TSV UTF-8 encoded file:
biorxiv_licenses.tsv
Awesome, I left you a few comments on the code at https://github.com/OmnesRes/prepub/commit/d6dea1cf6076627f87310d0f8601c63480ed981f. Let me know when you're done addressing them, and then I'll make a notebook/script in dhimmel/biorxiv-licenses
to read your dataset.
Hmm, it might change all multiple subjects to single subjects, I'm not sure, I'd have to check. I just looked at one bioRxiv paper which had like 7 subjects in PrePubMed but when I visited the link to bioRxiv it only had one. I don't think I'm going to reindex the articles for PrePubMed (the SQL database would have to get rebuilt which is kind of a pain but needs to be done at some point for other issues), but if you notice an article with multiple subjects you can check bioRxiv to see if it still has multiple subjects or I can go through the TSV file and check all the multiple subject articles.
Are you familiar with BeautifulSoup? If you ever need additional information that's not indexed you could maybe just make some alterations to my scripts.
Are you familiar with BeautifulSoup? If you ever need additional information that's not indexed you could maybe just make some alterations to my scripts.
I've used it a few times.
I just looked at one bioRxiv paper which had like 7 subjects in PrePubMed but when I visited the link to bioRxiv it only had one.
Check out 2.create-figure-data.ipynb
-- I didn't see any preprints with 7 subjects. In fact, very few preprints had multiple subjects, so not too worried about it.
I was looking at the preprint with 5 subjects: Animal Behavior and Cognition|Evolutionary Biology|Genetics|Genomics|Neuroscience
Looks like more because of the spaces.
I was looking at the preprint with 5 subjects:
What's the ID for this preprint?
Two preprints, by the same author... http://dx.doi.org/10.1101/034645 http://dx.doi.org/10.1101/034413
In https://github.com/dhimmel/biorxiv-licenses/issues/1, we've been discussing the formatting/encoding of
biorxiv/biorxiv.txt
.Currently, the file looks like:
Basically, each line is a python expression representing a biorxiv article. When using this format, I'd recommend replacing
eval
withast.literal_eval
so it's not possible to execute arbitrary code (security vulnerability).Would it be possible to add a header row to this file? This way the file is self documenting and you can add new columns without breaking pipelines that analyse this file.
Python expressions versus JSON
We had some discussion about the benefits of the current format: https://github.com/dhimmel/biorxiv-licenses/issues/1#issuecomment-263378746. @OmnesRes mentioned:
JSON gives you unicode and easy reading as text, but is a more standard way of encoding data so other languages besides Python would be able to read it. Use
indent=2
injson.dump
for readability. So up to you whether you want to switch to JSON. I'd recommend it if you're looking to make the biorxiv dataset more versatile.