bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling

461 stars 115 forks source link

Create dataset loader for CHEBI (Chapati) #113

Open leonweber opened 2 years ago

leonweber commented 2 years ago

Task: NER License: Creative Commons Format: custom Language: English Citation: ???

Referenced and used by "Habibi, Maryam, et al. "Deep learning with word embeddings improves biomedical named entity recognition." Bioinformatics"

Source: http://chebi.cvs.sourceforge.net/viewvc/chebi/chapati/

napsternxg commented 2 years ago

self-assign

hakunanatasha commented 2 years ago

Hi @napsternxg, can you let us know if you are still working on this so we can update our project board? Please just notify us the status by Friday April 8, no worries if you are not finished but intend to work on this. Please either ping me here at @hakunanatasha or ping the discord admins (with @admins)

napsternxg commented 2 years ago

Hi @hakunanatasha yes I plan to work on this over the weekend.

napsternxg commented 2 years ago

I have started working on this dataset. I will send a PR soon.

napsternxg commented 2 years ago

Hi @hakunanatasha and @leonweber I have a few questions on how to parse the data. Code related to my questions is in: https://colab.research.google.com/drive/1Ne8A76yn0vxwKkpU7l_OzGI968B-YieJ?usp=sharing

The data is in modified HTML format. I am able to parse is via beautiful soup library but that library is not part of our requirements file. What would be the best way to proceed? E.g. if if try to load the file via:

filepath = "./scrapbook/WO2007000651/source.xml"
reader = biocxml.BioCXMLDocumentReader(str(filepath))

I get the error:

AttributeError: 'BioCXMLDocumentReader' object has no attribute '_BioCXMLDocumentReader__document'

The data download requires CVS to be installed. How to should I address this, should I include a note on adding this. Is it better to just process the data and upload the processed data to huggingface dataset hub?

jason-fries commented 2 years ago

Hi @napsternxg Sorry about the delay in responding!

Let's remove the CVS dependency. The original gold data is open ("This work is distributed under the Creative Commons license: http://creativecommons.org/licenses/by/3.0/") so I would download the files and put them somewhere open (e.g., google drive link) and then we can eventually host the files on the biomedical community hub (see our BIOSSES example which does this).
The BioCXMLDocumentReader assumes you are using a BioC formatted file, so it won't work (that I know of) with standard or nonstandard XML files. The XML package available by default in Python might work here. If not, go ahead and use BeautifulSoup and we can discuss adding it to our supported packages.

napsternxg commented 2 years ago

Hi @jason-fries thanks for the response. I will download and upload the files somehere. I will try to use the XML parser in python if it doesn't add beautifulsoup.

I plan to submit it early next week.

napsternxg commented 2 years ago

Downloaded the files from CVS and uploading it here for usage. We can later move it to HF datasets and update the URL in the code. PatentAnnotations_GoldStandard.tar.gz

napsternxg commented 2 years ago

Added PR: #525