dmis-lab / BERN2

BERN2: an advanced neural biomedical namedentity recognition and normalization tool
http://bern2.korea.ac.kr
BSD 2-Clause "Simplified" License
174 stars 41 forks source link

Where do raw data files for Cell Ontology and Cellosaurus come from #7

Closed cthoyt closed 2 years ago

cthoyt commented 2 years ago

The Cell Ontology is currently on version 2022-01-05 and Cellosaurus is currently on version 40.0 (2021-12-16). I can see in the preprocess directory that there are hard-coded paths to older versions of these resources:

https://github.com/dmis-lab/BERN2/blob/a16b9c7b5f2e753ef5d1f159192a7d0d11f73bd7/preprocess/preprocess_cellontology.py#L3

https://github.com/dmis-lab/BERN2/blob/a16b9c7b5f2e753ef5d1f159192a7d0d11f73bd7/preprocess/preprocess_cellosaurus.py#L1

I'd like to process the newer versions of these ontologies, but it's not obvious how these input files were created. Can you please explain what created these two CSV and txt files, respectively?

If I had to guess, I'd say that the Cellosaurus file was coming from https://ftp.expasy.org/databases/cellosaurus/cellosaurus.txt and the Cell Ontology one might be coming from the CSV export listed on the BioPortal page for CL. Is that correct?

mjeensung commented 2 years ago

Hi @cthoyt!

Yes, it is correct.

The Cellosaurus file was coming from https://ftp.expasy.org/databases/cellosaurus/cellosaurus.txt, and the Cell Ontology one was coming from https://bioportal.bioontology.org/ontologies/CL.

You can also find the raw dictionary files we are using in resources/normalization/rawdata.