Open HermannKroll opened 5 months ago
PubTator3 uses Cellosaurus as terminology for Cell Line annotations. The respective vocabulary can be found at https://ftp.expasy.org/databases/cellosaurus/cellosaurus.xml.
The implementation is ready.
Can we run the detection on the PubMed collection and see the most frequently tagged entities? This way, we can check how many wrong or missleading tags we have or how good the overall vocabulary is.
SELECT ent_id, ent_str, COUNT(*)
FROM TAG
WHERE ent_type = 'CellLine'
GROUP BY ent_id, ent_str
ORDER BY COUNT(*) DESC
"CVCL:0023" "A549" 20854 "CVCL:0027" "HepG2" 20308 "CVCL:0031" "MCF-7" 19789 "CVCL:0030" "HeLa" 15632 "CVCL:0062" "MDA-MB-231" 14535 "CVCL:0025" "Caco-2" 7718 "CVCL:U508" "MN" 7472 "CVCL:0493" "RAW264.7" 7026 "CVCL:0019" "SH-SY5Y" 6931 "CVCL:0291" "HCT116" 6654 "CVCL:0481" "PC12" 6286 "CVCL:0045" "HEK293" 5419 "CVCL:0493" "RAW 264.7" 5332 "CVCL:0031" "MCF7" 5200 "CVCL:0004" "K562" 5072 "CVCL:0006" "THP-1" 4773 "CVCL:0395" "LNCaP" 4690 "CVCL:2492" "HL" 4632 "CVCL:0038" "HaCaT" 4486 "CVCL:0213" "CHO" 4303 "CVCL:0320" "HT-29" 4253 "CVCL:X218" "CMT" 3979 "CVCL:0286" "H9c2" 3473 "CVCL:0002" "HL-60" 3037 "CVCL:0065" "Jurkat" 2940 "CVCL:0409" "MC3T3-E1" 2872 "CVCL:0291" "HCT-116" 2693 "CVCL:0302" "HK-2" 2675 "CVCL:0105" "DU145" 2603 "CVCL:0182" "BV2" 2574 "CVCL:0546" "SW480" 2527 "CVCL:0063" "HEK293T" 2516 "CVCL:0007" "U937" 2437 "CVCL:0035" "PC-3" 2412 "CVCL:0188" "C2C12" 2397 "CVCL:0422" "MDCK" 2371 "CVCL:9115" "MEFs" 2337 "CVCL:0532" "SKOV3" 2313 "CVCL:0159" "B16F10" 2211 "CVCL:0145" "ARPE-19" 2165 "CVCL:0168" "BEAS-2B" 2140 "CVCL:0440" "MRC" 2036 "CVCL:0060" "H1299" 1991 "CVCL:0132" "A375" 1910 "CVCL:0480" "PANC-1" 1791 "CVCL:0462" "L929" 1783 "CVCL:0125" "4T1" 1780 "CVCL:0321" "HT22" 1764 "CVCL:0286" "H9C2" 1718 "CVCL:7254" "CT26" 1661 "CVCL:0021" "U251" 1652 "CVCL:0182" "BV-2" 1642 "CVCL:0022" "U87MG" 1604 "CVCL:0037" "A431" 1536 "CVCL:4140" "CAR-T" 1513 "CVCL:0320" "HT29" 1488 "CVCL:0035" "PC3" 1481 "CVCL:0042" "U2OS" 1466 "CVCL:E778" "MCF" 1466 "CVCL:0419" "MDA-MB-468" 1445 "CVCL:0045" "HEK-293" 1438 "CVCL:0326" "Hep3B" 1418 "CVCL:0547" "SW620" 1402 "CVCL:0426" "MG-63" 1334 "CVCL:0553" "T47D" 1310 "CVCL:0032" "SiHa" 1271 "CVCL:0063" "293T" 1257 "CVCL:0594" "NIH3T3" 1225 "CVCL:0336" "Huh7" 1106 "CVCL:0022" "U87" 1088 "CVCL:2246" "IPEC-J2" 1059 "CVCL:N540" "B16" 1056 "CVCL:0002" "HL60" 1044 "CVCL:0598" "MCF10A" 1040 "CVCL:0440" "MRC-5" 1020 "CVCL:5792" "LX-2" 1008 "CVCL:0186" "BxPC-3" 999 "CVCL:2959" "HUVEC" 988 "CVCL:0598" "MCF-10A" 977 "CVCL:1906" "HEp-2" 960 "CVCL:0520" "SGC-7901" 923 "CVCL:0532" "SKOV-3" 912 "CVCL:0152" "PC" 903 "CVCL:0134" "A2780" 897 "CVCL:1E42" "HO-1" 869 "CVCL:L894" "OSCC" 860 "CVCL:0534" "SMMC-7721" 859 "CVCL:7028" "SY" 858 "CVCL:M624" "HEK" 852 "CVCL:1511" "H1975" 852 "CVCL:0303" "HL-1" 831 "CVCL:0105" "DU-145" 822 "CVCL:7039" "EJ" 812 "CVCL:0459" "H460" 805 "CVCL:3285" "HFF" 802 "CVCL:0511" "Raji" 799 "CVCL:0431" "MIN6" 799 "CVCL:X905" "C6" 764 "CVCL:0399" "LoVo" 762 "CVCL:0426" "MG63" 751
Can be integrated.
Find the vocubalary that the NLM uses for PubTator. Use that vocabulary to translate CellLines in our services and to make CellLines searchable.
Maybe we also need to integrate a way to annotate CellLines by our own. PubTator uses TaggerOne (maybe there is a new version).