HermannKroll / NarrativeIntelligence

GNU General Public License v3.0
4 stars 0 forks source link

Narrative Service: Add vocabulary for cell lines #288

Open HermannKroll opened 5 months ago

HermannKroll commented 5 months ago

Find the vocubalary that the NLM uses for PubTator. Use that vocabulary to translate CellLines in our services and to make CellLines searchable.

Maybe we also need to integrate a way to annotate CellLines by our own. PubTator uses TaggerOne (maybe there is a new version).

ir0ntr0nik commented 2 months ago

PubTator3 uses Cellosaurus as terminology for Cell Line annotations. The respective vocabulary can be found at https://ftp.expasy.org/databases/cellosaurus/cellosaurus.xml.

  1. The size of the XML file is ~500MB
  2. The size of the preprocessed vocabulary is ~6.5MB
  3. The vocabulary contains roughly 150k entities

The implementation is ready.

HermannKroll commented 4 days ago

Can we run the detection on the PubMed collection and see the most frequently tagged entities? This way, we can check how many wrong or missleading tags we have or how good the overall vocabulary is.

HermannKroll commented 4 days ago
SELECT ent_id, ent_str, COUNT(*)
FROM TAG
WHERE ent_type = 'CellLine'
GROUP BY ent_id, ent_str
ORDER BY COUNT(*) DESC
ir0ntr0nik commented 4 days ago

https://github.com/user-attachments/files/17816659/data-1732026924688.csv

HermannKroll commented 3 days ago

"CVCL:0023" "A549" 20854 "CVCL:0027" "HepG2" 20308 "CVCL:0031" "MCF-7" 19789 "CVCL:0030" "HeLa" 15632 "CVCL:0062" "MDA-MB-231" 14535 "CVCL:0025" "Caco-2" 7718 "CVCL:U508" "MN" 7472 "CVCL:0493" "RAW264.7" 7026 "CVCL:0019" "SH-SY5Y" 6931 "CVCL:0291" "HCT116" 6654 "CVCL:0481" "PC12" 6286 "CVCL:0045" "HEK293" 5419 "CVCL:0493" "RAW 264.7" 5332 "CVCL:0031" "MCF7" 5200 "CVCL:0004" "K562" 5072 "CVCL:0006" "THP-1" 4773 "CVCL:0395" "LNCaP" 4690 "CVCL:2492" "HL" 4632 "CVCL:0038" "HaCaT" 4486 "CVCL:0213" "CHO" 4303 "CVCL:0320" "HT-29" 4253 "CVCL:X218" "CMT" 3979 "CVCL:0286" "H9c2" 3473 "CVCL:0002" "HL-60" 3037 "CVCL:0065" "Jurkat" 2940 "CVCL:0409" "MC3T3-E1" 2872 "CVCL:0291" "HCT-116" 2693 "CVCL:0302" "HK-2" 2675 "CVCL:0105" "DU145" 2603 "CVCL:0182" "BV2" 2574 "CVCL:0546" "SW480" 2527 "CVCL:0063" "HEK293T" 2516 "CVCL:0007" "U937" 2437 "CVCL:0035" "PC-3" 2412 "CVCL:0188" "C2C12" 2397 "CVCL:0422" "MDCK" 2371 "CVCL:9115" "MEFs" 2337 "CVCL:0532" "SKOV3" 2313 "CVCL:0159" "B16F10" 2211 "CVCL:0145" "ARPE-19" 2165 "CVCL:0168" "BEAS-2B" 2140 "CVCL:0440" "MRC" 2036 "CVCL:0060" "H1299" 1991 "CVCL:0132" "A375" 1910 "CVCL:0480" "PANC-1" 1791 "CVCL:0462" "L929" 1783 "CVCL:0125" "4T1" 1780 "CVCL:0321" "HT22" 1764 "CVCL:0286" "H9C2" 1718 "CVCL:7254" "CT26" 1661 "CVCL:0021" "U251" 1652 "CVCL:0182" "BV-2" 1642 "CVCL:0022" "U87MG" 1604 "CVCL:0037" "A431" 1536 "CVCL:4140" "CAR-T" 1513 "CVCL:0320" "HT29" 1488 "CVCL:0035" "PC3" 1481 "CVCL:0042" "U2OS" 1466 "CVCL:E778" "MCF" 1466 "CVCL:0419" "MDA-MB-468" 1445 "CVCL:0045" "HEK-293" 1438 "CVCL:0326" "Hep3B" 1418 "CVCL:0547" "SW620" 1402 "CVCL:0426" "MG-63" 1334 "CVCL:0553" "T47D" 1310 "CVCL:0032" "SiHa" 1271 "CVCL:0063" "293T" 1257 "CVCL:0594" "NIH3T3" 1225 "CVCL:0336" "Huh7" 1106 "CVCL:0022" "U87" 1088 "CVCL:2246" "IPEC-J2" 1059 "CVCL:N540" "B16" 1056 "CVCL:0002" "HL60" 1044 "CVCL:0598" "MCF10A" 1040 "CVCL:0440" "MRC-5" 1020 "CVCL:5792" "LX-2" 1008 "CVCL:0186" "BxPC-3" 999 "CVCL:2959" "HUVEC" 988 "CVCL:0598" "MCF-10A" 977 "CVCL:1906" "HEp-2" 960 "CVCL:0520" "SGC-7901" 923 "CVCL:0532" "SKOV-3" 912 "CVCL:0152" "PC" 903 "CVCL:0134" "A2780" 897 "CVCL:1E42" "HO-1" 869 "CVCL:L894" "OSCC" 860 "CVCL:0534" "SMMC-7721" 859 "CVCL:7028" "SY" 858 "CVCL:M624" "HEK" 852 "CVCL:1511" "H1975" 852 "CVCL:0303" "HL-1" 831 "CVCL:0105" "DU-145" 822 "CVCL:7039" "EJ" 812 "CVCL:0459" "H460" 805 "CVCL:3285" "HFF" 802 "CVCL:0511" "Raji" 799 "CVCL:0431" "MIN6" 799 "CVCL:X905" "C6" 764 "CVCL:0399" "LoVo" 762 "CVCL:0426" "MG63" 751

HermannKroll commented 1 day ago

Can be integrated.