Open leonweber opened 2 years ago
Hi @napsternxg, can you let us know if you are still working on this so we can update our project board? Please just notify us the status by Friday April 8, no worries if you are not finished but intend to work on this. Please either ping me here at @hakunanatasha or ping the discord admins (with @admins)
Hi @hakunanatasha yes I plan to work on this over the weekend.
I have started working on this dataset. I will send a PR soon.
Hi @hakunanatasha and @leonweber I have a few questions on how to parse the data. Code related to my questions is in: https://colab.research.google.com/drive/1Ne8A76yn0vxwKkpU7l_OzGI968B-YieJ?usp=sharing
filepath = "./scrapbook/WO2007000651/source.xml"
reader = biocxml.BioCXMLDocumentReader(str(filepath))
I get the error:
AttributeError: 'BioCXMLDocumentReader' object has no attribute '_BioCXMLDocumentReader__document'
Hi @napsternxg Sorry about the delay in responding!
BioCXMLDocumentReader
assumes you are using a BioC formatted file, so it won't work (that I know of) with standard or nonstandard XML files. The XML package available by default in Python might work here. If not, go ahead and use BeautifulSoup and we can discuss adding it to our supported packages. Hi @jason-fries thanks for the response. I will download and upload the files somehere. I will try to use the XML parser in python if it doesn't add beautifulsoup.
I plan to submit it early next week.
Downloaded the files from CVS and uploading it here for usage. We can later move it to HF datasets and update the URL in the code. PatentAnnotations_GoldStandard.tar.gz
Added PR: #525
Task: NER License: Creative Commons Format: custom Language: English Citation: ???
Referenced and used by "Habibi, Maryam, et al. "Deep learning with word embeddings improves biomedical named entity recognition." Bioinformatics"
Source: http://chebi.cvs.sourceforge.net/viewvc/chebi/chapati/