Suggestions for encoding biorxiv.txt

dhimmel commented 8 years ago

In https://github.com/dhimmel/biorxiv-licenses/issues/1, we've been discussing the formatting/encoding of biorxiv/biorxiv.txt.

Currently, the file looks like:

[u'Alternative Growth Behavior of Mycobacterium Avium Subspecies and Staphylococci with Implications for Clinical Microbiology and Blood Culture', ['Peilin Zhang', 'Lawrence M Minardi', 'J. Todd Kuenstner', 'Steve M Zekan'], u'April 16, 2016.', u'Rapid culture of Mycobacterium avium subspecies paratuberculosis (MAP) from patients remains a challenge.  During the process of developing a rapid culture method for MAP, we found that there is an alternative growth behavior present in MAP, MAH (Mycobacterium avium subspecies hominissuis) and other bacteria such as Staphylococcus aureus, and Staphylococcus pseudintermedius.  The bacterial DNA, RNA and proteins are present in the supernatants of the liquid culture media after routine microcentrifugation. When cultured in the solid media plate, there are a limited number of colonies developed for MAP and MAH disproportionate to the growth.  We believe there is an alternative growth behavior for MAP, MAH and other bacteria similar to phenoptosis.  Based on the alternative bacterial growth behavior, we tested 62 blood culture specimens that have been reported negative by routine automated blood culture method after 5 days of incubation. We used alternative culture media and molecular diagnostic techniques to test these negative culture bottles, and we found a large percentage of bacterial growth by alternative culture media (32%) and by molecular PCR amplification using 16s rDNA primer set and DNA sequencing (69%).  The sensitivity of detection by the molecular PCR/sequencing method is significantly higher than by routine automated blood culture.  Given the challenge of early diagnosis of sepsis in the hospital setting, it is necessary to develop more sensitive and faster diagnostic tools to guide clinical practice and improve the outcome of sepsis management.', u'/content/early/2016/04/16/049031', [u'Microbiology'], [u'PZM Diagnostics, LLC']]
[u'Lateral genetic transfers between eukaryotes and bacteriophages', ['Sarah R Bordenstein', 'Seth R Bordenstein'], u'April 16, 2016.', u'Viruses are trifurcated into eukaryotic, archaeal and bacterial categories. This domain-specific ecology underscores why eukaryotic genes are typically co-opted by eukaryotic viruses and bacterial genes are commonly found in bacteriophages. However, the presence of bacteriophages in symbiotic bacteria that obligately reside in eukaryotes may promote eukayotic DNA transfers to bacteriophages. By sequencing full genomes from purified bacteriophage WO particles of Wolbachia, we discover a novel eukaryotic association module with various animal proteins domains, such as the black widow latrotoxin-CTD, that are uninterrupted in intact bacteriophage genomes, enriched with eukaryotic protease cleavage sites, and combined with additional domains to forge some of the largest bacteriophage genes (up to 14,256 bp). These various protein domain families are central to eukaryotic functions and have never before been reported in packaged bacteriophages, and their phylogeny, distribution and sequence diversity implies lateral transfer from animal to bacteriophage genomes. We suggest that the evolution of these eukaryotic protein domains in bacteriophage WO parallels the evolution of eukaryotic genes in canonical eukaryotic viruses, namely those commandeered for viral life cycle adaptations. Analogous selective pressures and evolutionary outcomes may occur in bacteriophage WO as a result of its "two-fold cell challenge" to persist in and traverse cells of obligate intracellular bacteria that strictly reside in animal cells.  Finally, the full WO genome sequences and identification of attachment sites will advance eventual genetic manipulation of Wolbachia for disease control strategies.', u'/content/early/2016/04/16/049049', [u'Genomics'], [u'Vanderbilt University']]

Basically, each line is a python expression representing a biorxiv article. When using this format, I'd recommend replacing eval with ast.literal_eval so it's not possible to execute arbitrary code (security vulnerability).

Would it be possible to add a header row to this file? This way the file is self documenting and you can add new columns without breaking pipelines that analyse this file.

Python expressions versus JSON

We had some discussion about the benefits of the current format: https://github.com/dhimmel/biorxiv-licenses/issues/1#issuecomment-263378746. @OmnesRes mentioned:

If the format was in JSON or pickled I don't know that I would be able to quickly spot errors. The format also has the benefit of storing and writing unicode.

JSON gives you unicode and easy reading as text, but is a more standard way of encoding data so other languages besides Python would be able to read it. Use indent=2 in json.dump for readability. So up to you whether you want to switch to JSON. I'd recommend it if you're looking to make the biorxiv dataset more versatile.

OmnesRes commented 8 years ago

Thus far I have not had any collaborators on PrePubMed and as a result just used a format that was easy for me to use. Sometimes ASAPbio will use some of the data in PrePubMed and I've asked them if the files were okay and they didn't seem to mind.

I think I will continue to generate the current format for my own (and PrePubMed's) use, but there is no reason I couldn't start creating files which are meant to be used by third parties. If you describe or provide code for an optimal format for you I could start generating that file. I use Python 2.7 by the way, sorry.

dhimmel commented 8 years ago

I use Python 2.7 by the way, sorry.

This is the first time I've seem someone apologize for not using python 3 :smile_cat: The Times They Are a-Changin'. Just in case there's any confusion, my original comment applies equally to python 2 & 3.

I think I will continue to generate the current format for my own (and PrePubMed's) use

If you added header information and a license field to the current file, I wouldn't need anything more.

OmnesRes commented 8 years ago

I think I'm just going to go ahead and have PrePubMed generate a TSV like the one you made. Do you want additional information in that file or exactly like you have it?

dhimmel commented 8 years ago

I think I'm just going to go ahead and have PrePubMed generate a TSV like the one you made. Do you want additional information in that file or exactly like you have it?

Well I think the additional information may come in handy. But given all the compound fields (fields that are actually lists), JSON may be a better format. It would be easy to convert from JSON to dataframe using pandas.

But up to you, of course. A TSV with essential columns would definitely be helpful for many users.

In general my three recommendations are:

column/field names
don't manually implement io formatting. For example, use a package such as csv, json, or pandas (not built in).
standardized formats best

OmnesRes commented 8 years ago

I am familiar with JSON and like it since I love dictionaries, but I'm not sure every interested party will know how to deal with JSON.

Is there a reason I can't just have a TSV with list entries separated with commas like I did in the file shared with Twitter?

Also, do you know how to get Python to write unicode to a file? I'll google it.

dhimmel commented 8 years ago

Is there a reason I can't just have a TSV with list entries separated with commas like I did in the file shared with Twitter?

This could get complicated when fields contain commas. I usually use | to delimit within fields for that reason. The problem is that there isn't really a standard and thus some manual investigation and post-processing is needed after reading in the file. But I do still think TSV is a valuable format.

Also, do you know how to get Python to write unicode to a file? I'll google it.

In 3 it would just work. In 2.7, I think it will also just work... the file will need to be utf-8 encoded.

dhimmel commented 8 years ago

Also, do you know how to get Python to write unicode to a file? I'll google it.

According to http://stackoverflow.com/a/19591815/4651668, use io.open with encoding='utf8.

OmnesRes commented 8 years ago

Hmm, I'm looking into it right now. The only entries that I allow to contain unicode characters are Titles and Abstracts. I don't know if you want these in the TSV file or not.

OmnesRes commented 8 years ago

Hmm, I suppose affiliation might also have non-ascii in it. Could be interesting to see if affiliation affects license type.

dhimmel commented 8 years ago

What about names? There are many non-ascii names in science such as Antoine Lizée. I've found life is much easier when all text is unicode. Not sure how hard that would be in 2.7. If you're interested in a late night, you could look into migrating to python 3.

Also I'm happy to review any changes if you'd like. Just submit a pull request to your repo and mention me.

OmnesRes commented 8 years ago

Yep, yep, making PrePubMed was the first time I discovered the horrors of non-ASCII characters. As described at http://www.prepubmed.org/help/ I store author names with ASCII characters only: capture

Will this upset the author who is trying to search for their own name how it supposed to be written? Probably. Will someone else searching for the name type the name out with only ASCII characters? Hopefully.

Does PubMed allow non-ASCII characters in names?

I guess ideally my search would recognize a name regardless of how people search for it, but seems like a lot of effort for such a fringe case and the small number of users I have.

Currently I do this: import unicodedata unicodedata.normalize('NFKD','author').encode('ascii','ignore')

If this is a horrible thing to do I apologize.

I'm currently working on a giant project so I don't really have time to make major overhauls to PrePubMed, which may become obsolete if ASAPbio creates its Central Service.

dhimmel commented 8 years ago

I'm currently working on a giant project so I don't really have time to make major overhauls to PrePubMed, which may become obsolete if ASAPbio creates its Central Service.

Totally understand -- your solution makes sense to me!

OmnesRes commented 8 years ago

Using io.open seems legit. I'll make a TSV file with the doi, date (in a Python datetime format), subjects, license, version, title, affiliations. I'll separate subjects and affiliations with "|".

I noticed that some preprints which previously had multiple subjects now only have a single subject if I visit the page, not sure if you want those fixed or not.

The title will allow you to remove duplicated preprints if you desire. The affiliation is a little tricky. I just grab all affiliations listed. When using affiliation to determine its affect on license type it may be best to only use the affiliation of the corresponding author (this information would have to be indexed).

This is somewhat unrelated, but ASAPbio had me grab all the email addresses on bioRxiv, which was like 20K. Not sure if you want those for anything. You might be surprised by the number of @aol addresses.

OmnesRes commented 7 years ago

I now have a TSV UTF-8 encoded file: biorxiv_licenses.tsv I'm closing the issue.

dhimmel commented 7 years ago

I noticed that some preprints which previously had multiple subjects now only have a single subject if I visit the page, not sure if you want those fixed or not.

Whatever PrePubMed decides on we'll go with! I'm fine with outsourcing all of these hard decisions. If all preprints now only have a single subject, this may be worth updating, as it changes subjects to subject and will affect how we analyze the data.

I now have a TSV UTF-8 encoded file: biorxiv_licenses.tsv

Awesome, I left you a few comments on the code at https://github.com/OmnesRes/prepub/commit/d6dea1cf6076627f87310d0f8601c63480ed981f. Let me know when you're done addressing them, and then I'll make a notebook/script in dhimmel/biorxiv-licenses to read your dataset.

OmnesRes commented 7 years ago

Hmm, it might change all multiple subjects to single subjects, I'm not sure, I'd have to check. I just looked at one bioRxiv paper which had like 7 subjects in PrePubMed but when I visited the link to bioRxiv it only had one. I don't think I'm going to reindex the articles for PrePubMed (the SQL database would have to get rebuilt which is kind of a pain but needs to be done at some point for other issues), but if you notice an article with multiple subjects you can check bioRxiv to see if it still has multiple subjects or I can go through the TSV file and check all the multiple subject articles.

Are you familiar with BeautifulSoup? If you ever need additional information that's not indexed you could maybe just make some alterations to my scripts.

dhimmel commented 7 years ago

Are you familiar with BeautifulSoup? If you ever need additional information that's not indexed you could maybe just make some alterations to my scripts.

I've used it a few times.

I just looked at one bioRxiv paper which had like 7 subjects in PrePubMed but when I visited the link to bioRxiv it only had one.

Check out 2.create-figure-data.ipynb -- I didn't see any preprints with 7 subjects. In fact, very few preprints had multiple subjects, so not too worried about it.

OmnesRes commented 7 years ago

I was looking at the preprint with 5 subjects: Animal Behavior and Cognition|Evolutionary Biology|Genetics|Genomics|Neuroscience

Looks like more because of the spaces.

dhimmel commented 7 years ago

I was looking at the preprint with 5 subjects:

What's the ID for this preprint?

OmnesRes commented 7 years ago

Two preprints, by the same author... http://dx.doi.org/10.1101/034645 http://dx.doi.org/10.1101/034413

OmnesRes / prepub

Suggestions for encoding biorxiv.txt #5

Python expressions versus JSON