bigscience-workshop / biomedical

Tools for curating biomedical training data for large-scale language modeling
461 stars 115 forks source link

Create dataset loader for CHEBI (Chapati) #113

Open leonweber opened 2 years ago

leonweber commented 2 years ago

Task: NER License: Creative Commons Format: custom Language: English Citation: ???

Referenced and used by "Habibi, Maryam, et al. "Deep learning with word embeddings improves biomedical named entity recognition." Bioinformatics"

Source: http://chebi.cvs.sourceforge.net/viewvc/chebi/chapati/

napsternxg commented 2 years ago

self-assign

hakunanatasha commented 2 years ago

Hi @napsternxg, can you let us know if you are still working on this so we can update our project board? Please just notify us the status by Friday April 8, no worries if you are not finished but intend to work on this. Please either ping me here at @hakunanatasha or ping the discord admins (with @admins)

napsternxg commented 2 years ago

Hi @hakunanatasha yes I plan to work on this over the weekend.

napsternxg commented 2 years ago

I have started working on this dataset. I will send a PR soon.

napsternxg commented 2 years ago

Hi @hakunanatasha and @leonweber I have a few questions on how to parse the data. Code related to my questions is in: https://colab.research.google.com/drive/1Ne8A76yn0vxwKkpU7l_OzGI968B-YieJ?usp=sharing

filepath = "./scrapbook/WO2007000651/source.xml"
reader = biocxml.BioCXMLDocumentReader(str(filepath))

I get the error:

AttributeError: 'BioCXMLDocumentReader' object has no attribute '_BioCXMLDocumentReader__document'
jason-fries commented 2 years ago

Hi @napsternxg Sorry about the delay in responding!

napsternxg commented 2 years ago

Hi @jason-fries thanks for the response. I will download and upload the files somehere. I will try to use the XML parser in python if it doesn't add beautifulsoup.

I plan to submit it early next week.

napsternxg commented 2 years ago

Downloaded the files from CVS and uploading it here for usage. We can later move it to HF datasets and update the URL in the code. PatentAnnotations_GoldStandard.tar.gz

napsternxg commented 2 years ago

Added PR: #525