Knowledge-Graph-Hub / kg-microbe

https://knowledge-graph-hub.github.io/kg-microbe/index.html
BSD 3-Clause "New" or "Revised" License
14 stars 3 forks source link

filter out viruses from NCBI Taxonomy 'unclassified' #126

Open realmarcin opened 4 months ago

realmarcin commented 4 months ago

These viruses are found in the ncbitaxon_removed_subset.json:

"val" : "Cotton leaf curl Rajasthan virus betasatellite defective interfering DNA" "lbl" : "Cotton leaf curl Rajasthan virus defective interfering DNA", "lbl" : "Cotton leaf curl virus betasatellite defective interfering DNA", "lbl" : "Hygrophorus parvirussula", "lbl" : "unidentified Cotton leaf curl Rajasthan virus-associated DNA",

realmarcin commented 4 months ago

@bsantan let's think about how to add this to the transform code. I believe filtering on no 'virus' or 'phage' in the reference proteome names will work. We can say we assume that no multicellular organism is 'unclassified' -- though this may not be entirely true. The first pass/test transform could just exclude anything from 'unclassified'.