compbiocore / qcdb

QC Database
0 stars 1 forks source link

Parser for picardtools does not work on paired end data correctly #30

Closed aguang closed 2 years ago

aguang commented 4 years ago

If alignments for picardTools are from paired end data, then it will currently get parsed as

| 237 | SRS4807081_SRX5884551 | picard     | alignmentmetrics | "{\"c\": \"PAIR\", \"t\": \"83931727\", \"p\": \"83931727\", \"ppr\": \"1\", \"pnr\": \"0\", \"pra\": \"83931727\", \"ppra\": \"1\", \"pab\": \"3804319298\", \"phar\": \"75090141\", \"phab\": \"3322657306\", \"phaqb\": \"3265450307\", \"phmm\": \"0\", \"pmr\": \"0.640134\", \"pher\": \"0.671578\", \"pir\": \"0.000278\", \"m\": \"100.09776\", \"r\": \"83358818\", \"praip\": \"0.993174\", \"prip\": \"744563\", \"pprip\": \"0.008871\", \"b\": \"0\", \"s\": \"0.501104\", \"pc\": \"0.002787\", \"pa\": \"0\", \"s1\": \"\", \"l\": \"\", \"rg\": \"\"}" |

This is the wrong behavior, because those files have the rows FIRST_OF_PAIR and SECOND_OF_PAIR as well, which should have also been added to the JSON. This is because the picardTools parser uses self.library_read_type to determine whether reads are single end or paired end, but because baseParser uses _1_metric or _2_metric or _metric to determine whether samples are paired or single end, and bioflows picardTools output is always of the form _metric everything gets parsed as single end data. Also, self.library_read_type will be going away per #10. The fix should just parse all the rows without trying to decide if data is paired or single end, and add a column for read_id that has blank values.

aguang commented 2 years ago

Fixed with pull request #34