VDBWRAIR / pathdiscov

Pathogen Discover Pipeline
1 stars 1 forks source link

Smarter top.blast selection? #275

Open averagehat opened 9 years ago

averagehat commented 9 years ago

@demis001 @JunHang Currently the pipeline only takes the very first top result from each conflict as a "top" result to annotate. Would it be of any value you to select more results (above some e-value threshold)?

JunHang commented 9 years ago

Classification: UNCLASSIFIED Caveats: NONE

Hi Mike, Your question is very interesting and relevant, however, may be difficult to fix. The cause is the fact that many entries in NCBI database have insufficient or even wrong classification. It has become more problematic due to metagenomics sequencing. For example, we are doing arbovirus discovery from mosquitoes. Then the top BLAST hit is a 'mosquito virus' with taxonomy 'unclassified', which doesn't help me understand which virus it is. You may have all top 10 hits are unclassified... The short answer for you to consider is: the additional hits to be included are the ones with the lowest e-value (not necessarily above e-value threshold) AND classified (Family or Genus). Jun

-----Original Message----- From: Mike Panciera [mailto:notifications@github.com] Sent: Friday, July 24, 2015 4:13 PM To: VDBWRAIR/pathdiscov Cc: Hang, Jun CIV USARMY MEDCOM WRAIR (US) Subject: [pathdiscov] Smarter top.blast selection? (#275)

@demis001 https://github.com/demis001 @JunHang https://github.com/JunHang Currently the pipeline only takes the very first top result from each conflict as a "top" result to annotate. Would it be of any value you to select more results (above some e-value threshold)?

— Reply to this email directly or view it on GitHub https://github.com/VDBWRAIR/pathdiscov/issues/275 . https://github.com/notifications/beacon/AKriS8asdQ2_tsx8QJQD25NhdAP22s3qks5ogpO5gaJpZM4FfbhS.gif

Classification: UNCLASSIFIED Caveats: NONE

averagehat commented 9 years ago

I will be looking into this more. @InaMBerry suggested that if the top 5-10 blast hits have a different taxonomy (than the first blast hit), that be flagged in the summary. The idea is to point out possibly mislabeled sequences in Genbank.

There was also some discussion about creating a phylogenetic tree from these top hits and the query sequence. The tree could be rendered and analyzed by the user and/or suspicious placement automatically reported by the pipeline.

Edit: One could also do the phylogenetic analysis with other organisms within the taxonomy fetched from blast.