TextpressoDevelopers / textpressocentral

Textpressocentral frontend web application
Other
2 stars 2 forks source link

Explain how literatures are classified. Classify literature according to model organisms #26

Open goldturtle opened 5 years ago

goldturtle commented 5 years ago

Transferring from email thread to this ticket.

Michael wrote on 2016-09-15:

There was a request to classify the PMCOA corpus according to more topics than just Biology, Medicine, Genetics, Genomics etc, i.e., according to model organisms. While ultimately we will use an SVM for that, right now I am scanning subject, title and journal name for keywords to classify the corpus and make what's known in TPC as literatures. So I would appreciate if you could contribute keywords for those fields (subject, article title and journal name) for the following organisms:

drosophila C. elegans arabidopsis mouse zebrafish

If you can think of any other model organism that would be of interest to the curation and biomedical community, let me know. Michael.

goldturtle commented 5 years ago

On 2016-09-22 Chris wrote:

Hi Michael,

I'm not exactly sure what you're looking for, but if you're looking for keywords to pull out papers for those species I guess I would suggest the following:

for drosophila: Drosophila, Drosophila melanogaster, D. melanogaster, fruit fly

for C. elegans: C. elegans, Caenorhabditis elegans, Caenorhabditis

for arabidopsis: Arabidopsis, A. thaliana, Arabidopsis thaliana

for mouse: Mus musculus, musculus, murine, mouse, mice

for zebrafish: zebrafish, Danio rerio, D. rerio

As for other model organisms of interest to the curation and biomedical community:

budding yeast: Saccharomyces cerevisiae

fission yeast: Schizosaccharomyces pombe

slime mold: Dictyostelium discoideum

Norway brown rat: Rattus norvegicus

black rat: Rattus rattus

sea squirt: Ciona intestinalis

African clawed frog: Xenopus laevis

Western clawed frog: Xenopus tropicalis

Bacteria: Escherichia coli (E. coli), Bacillus subtilis (B. subtilis)

There's an extensive list on Wikipedia:

https://en.wikipedia.org/wiki/List_of_model_organisms

I hope that helps,

Chris

goldturtle commented 5 years ago

Hi Michael,

I wasn't quite sure what you meant by 'subject' in your email, but here are a few other thoughts for classifying literature by organism.

Paper titles and abstracts are probably reasonably good sources of organism names, with all the usual caveats about false positives (e.g. organism is mentioned but the paper does not contain experiments about it) and false negatives (e.g. authors mention mouse but also do an experiment in a human cell line that they don't mention).

For papers that have already been indexed by PubMed, the MESHHeadingList and ChemicalList tags in the XML could also be used.

The list of organisms on the GO annotation downloads page may be helpful, since it indicates organisms for which there was at least sufficient interest to generate GO annotations:

http://geneontology.org/page/download-annotations

Other microbial species that might be of interest are listed in Table 1 of this WormBook chapter:

http://www.wormbook.org/chapters/www_intermicrobpath/intermicrobpath.html

goldturtle commented 5 years ago

More from Chris and Michael:

Michael: In addition to the official names, are there also common words that are used to identify species? For example, if there is only the word yeast in the title, would it be safe to assign it to S. cerevisiae (as opposed to fission yeast)?

Chris: For S. cerevisiae the only common words might be "baker's yeast" or "budding yeast". "Yeast" alone would not be sufficient as there are so many types of yeast. For the others that I listed I can only think of (only know of) the names given to the left of the species name, e.g. "slime mold" (of which there are also many types so you would get false positives if looking for Dictyostelium discoideum). For Rattus norvegicus, phrases could be "Norway rat", "brown Norway rat", or simply "brown rat". Wikipedia has some common names for each but I don't know if they would come up in the biomedical literature.

goldturtle commented 5 years ago

We could also solicit input from the various MODs here, as most groups probably have PubMed keyword searches that get them a reasonably good list of papers, or at least could help us avoid any obvious pitfalls.

goldturtle commented 5 years ago

In the nxml files there is a 'subject' line, but I don't know what guidelines it follows. I'll follow up on your links.

M.

On 09/27/2016 10:59 AM, vanaukenk wrote:

Hi Michael,

I wasn't quite sure what you meant by 'subject' in your email, but here are a few other thoughts for classifying literature by organism.

Paper titles and abstracts are probably reasonably good sources of organism names, with all the usual caveats about false positives (e.g. organism is mentioned but the paper does not contain experiments about it) and false negatives (e.g. authors mention mouse but also do an experiment in a human cell line that they don't mention).

For papers that have already been indexed by PubMed, the MESHHeadingList and ChemicalList tags in the XML could also be used.

The list of organisms on the GO annotation downloads page may be helpful, since it indicates organisms for which there was at least sufficient interest to generate GO annotations:

http://geneontology.org/page/download-annotations

Other microbial species that might be of interest are listed in Table 1 of this WormBook chapter:

http://www.wormbook.org/chapters/www_intermicrobpath/intermicrobpath.html

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub

66 (comment),

or mute the thread https://github.com/notifications/unsubscribe-auth/AIPBG27UFPB8FmUtqhPcmGNOy4ZPzf8Aks5quVmegaJpZM4Hls9K.

goldturtle commented 5 years ago

Okay, thanks. I was looking at the XML display for papers in PubMed and searching the page for 'Subject' but couldn't find anything there. For example: https://www.ncbi.nlm.nih.gov/pubmed/27665728?report=xml&format=text Do you have a URL or other location for an nxml that I could look at?

goldturtle commented 5 years ago

I attached an example. Open it in an text editor and search for the subject tag (quite at the beginning of the file). The keywords are embedded in the tags.

M.

On 09/27/2016 11:17 AM, vanaukenk wrote:

Okay, thanks. I was looking at the XML display for papers in PubMed and searching the page for 'Subject' but couldn't find anything there. For example: https://www.ncbi.nlm.nih.gov/pubmed/27665728?report=xml&format=text Do you have a URL or other location for an nxml that I could look at?

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub

66 (comment),

or mute the thread https://github.com/notifications/unsubscribe-auth/AIPBGyeLpIpnc7LJqItae6JOjCvphgdbks5quV2jgaJpZM4Hls9K.

goldturtle commented 5 years ago

Interesting example.

Comparing the nxml subjects: Ecology Evolutionary Biology Ecology/Behavioral Ecology Ecology/Evolutionary Ecology Ecology/Population Ecology Evolutionary Biology/Animal Behavior Evolutionary Biology/Evolutionary Ecology

with the MESHHeadings: Animals

Grasshoppers genetics

Phenotype

Phylogeny

Pigmentation genetics

Polymorphism, Genetic

Population Dynamics

Survival Analysis

there isn't much, if any overlap, and the actual organism, grasshoppers, is only represented in the MESHHeadingList.

I don't know how representative this example is, but it suggests that maybe a union of the nxml subjects with the MESH terms might be the best source for mining subjects for literature classification.

goldturtle commented 5 years ago

Well, in this case the animal is mentioned in the title. I don't understand why PMC doesn't have the MeSH terms in their nxmls. The issue is that I would like to do any classification with the nxml file only and not introduce a third source that needs to be synced with the paper.

M.

On 09/27/2016 11:57 AM, vanaukenk wrote:

Interesting example.

Comparing the nxml subjects: Ecology Evolutionary Biology Ecology/Behavioral Ecology Ecology/Evolutionary Ecology Ecology/Population Ecology Evolutionary Biology/Animal Behavior Evolutionary Biology/Evolutionary Ecology

with the MESHHeadings: Animals

Grasshoppersgenetics

Phenotype

Phylogeny

Pigmentation genetics

Polymorphism, Genetic

Population Dynamics

Survival Analysis

there isn't much, if any overlap, and the actual organism, grasshoppers, is only represented in the MESHHeadingList.

I don't know how representative this example is, but it suggests that maybe a union of the nxml subjects with the MESH terms might be the best source for mining subjects for literature classification.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub

66 (comment),

or mute the thread https://github.com/notifications/unsubscribe-auth/AIPBG3Xx_X3aRxIcQnBs305YpIkGJpBeks5quWcxgaJpZM4Hls9K.

goldturtle commented 5 years ago

Yes, I see the point about not wanting to have to go to a third source. I don't know why PMC doesn't also include MeSH terms in their nxmls. Perhaps an alternative would be to make use of MeSH headings to find other terms or phrases to include for organismal TPC literature classification, although I suspect you have a good list to start from. For all of the genus species names, though, I think you will want to also look at abbreviations like: S. cerevisiae X. laevis etc.