CPTR-ReSeqTB / UVP

Mycobacterium tuberculosis next generation sequence analysis
MIT License
22 stars 12 forks source link

"Not species specific" when sample is Mycobacterium Tuberculosis #15

Closed matnguyen closed 5 years ago

matnguyen commented 5 years ago

When checking species specificity, samples can be discarded because Kraken may classify reads as Mycobacterium, rather than Mycobacterium tuberculosis. However, the code (below) only checks for Mycobacterium tuberculosis.

def  runKraken(self):
...
    for lines in fh1:
                fields = lines.rstrip("\r\n").split("\t")
                if fields[5].find("Mycobacterium tuberculosis") != -1:
                   cov += float(fields[0])
            fh1.close()
            if cov < 90:
               self.__CallCommand('mv', ['mv', self.fOut, self.flog])
               #self.__CallCommand('rm', ['rm', self.kraken + "/kraken.txt"])
               self.__logFH.write("not species specific\n")
               i = datetime.now()
               self.__logFH2.write(i.strftime('%Y/%m/%d %H:%M:%S') + "\t" + "Input:" + "\t" + self.input + "\t" + "not species specific\n")
           sys.exit(2) 

The final_report.txt for Kraken contains:

  0.01  407 407 U   0   unclassified
 99.99  3225236 0   -   1   root
 99.99  3225235 3   -   131567    cellular organisms
 99.99  3225231 114 D   2       Bacteria
 99.98  3224864 9   -   1783272       Terrabacteria group
 99.92  3223162 2   P   201174          Actinobacteria
 99.92  3223160 5   C   1760              Actinobacteria
 99.92  3223117 21  O   85007               Corynebacteriales
 99.92  3223094 87  F   1762                  Mycobacteriaceae
 99.92  3223004 2709350 G   1763                    Mycobacterium
 15.83  510550  287089  -   77643                     Mycobacterium tuberculosis complex
  6.76  218100  200172  S   1773                        Mycobacterium tuberculosis
  0.44  14259   14259   -   1334058                       Mycobacterium tuberculosis TRS12

For reference, here is an accession that I have tested:

mezewudo commented 5 years ago

You are right, edited the line to reflect 'Mycobacterium tuberculosis complex'

matnguyen commented 5 years ago

Since Kraken can sometimes be too general in its classification (Mycobacterium instead of Mycobacterium tuberculosis), would changing that line to accept "Mycobacterium" work better since then viable samples would not be discarded by UVP?

mezewudo commented 5 years ago

I still want to discriminate against Mycobacterium that is not in the Mycobacterium tuberculosis complex.

On Fri, Nov 2, 2018 at 6:54 PM Matthew Nguyen notifications@github.com wrote:

Since Kraken can sometimes be too general in its classification (Mycobacterium instead of Mycobacterium tuberculosis), would changing that line to accept "Mycobacterium" work better since then viable samples would not be discarded by UVP?

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/CPTR-ReSeqTB/UVP/issues/15#issuecomment-435532356, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLYb5rREFUBzu-yoIAiinezaCCpi3Geks5urM1DgaJpZM4YMUoY .

pvishwa2 commented 5 years ago

Actually, wouldn't the find("Mycobacterium tuberculosis") match the "Mycobacterium tuberculosis complex" in addition to all subsequent "Mycobacterium tuberculosis" containing lines? The issue (I think) may lie in the Kraken database used. I've compared Kraken results from the Galaxy server (using the bacteria database) and from a local machine (using the standard database) that show the same kind of result that matnguyen got. The Galaxy results showed MTBC cov values at > 90, while the local versions topped out at roughly 25. Should I be looking at using the Kraken bacteria database instead?

mezewudo commented 5 years ago

Yeah, it has to be run on bacteria database. I will update the documentation in the next major revision of the software.

On Wed, Nov 21, 2018 at 3:04 PM pvishwa2 notifications@github.com wrote:

Actually, wouldn't the find("Mycobacterium tuberculosis") match the "Mycobacterium tuberculosis complex" in addition to all subsequent "Mycobacterium tuberculosis" containing lines? The issue (I think) may lie in the Kraken database used. I've compared Kraken results from the Galaxy server (using the bacteria database) and from a local machine (using the standard database) that show the same kind of result that matnguyen got. Maybe you could add it somewhere in the dependencies that UVP needs to be run using the "bacteria" kraken database specifically?

— You are receiving this because you modified the open/close state.

Reply to this email directly, view it on GitHub https://github.com/CPTR-ReSeqTB/UVP/issues/15#issuecomment-440793250, or mute the thread https://github.com/notifications/unsubscribe-auth/AFLYb4LvR477MOo76H9bRjvmnxEAuceaks5uxbHZgaJpZM4YMUoY .