PacificBiosciences / apps-scripts

Miscellaneous scripts for applications of PacBio systems
Other
25 stars 11 forks source link

error in read filtering #6

Closed ke-shi closed 6 years ago

ke-shi commented 6 years ago

We tried to filter reads from a bam file using the following methods.

  1. Made a list file containing subread names [movie]/[holenumber]/[qStart]_[qEnd] like:

A file "textFile1.list" listing

m54058R2_170526_090336/10027128/716_27569 ...

  1. Performed the following command

dataset filter in.dataset.xml filtered.dataset.xml 'qname=textFile1.list'

The resultant filtered.dataset.xml contained:

    <pbds:Filters>
            <pbds:Filter>
                    <pbbase:Properties>
                            <pbbase:Property Name="qname" Operator="=" Value="textFile1.list"/>
                    </pbbase:Properties>
            </pbds:Filter>
    </pbds:Filters>
  1. Then,

dataset consolidate filtered.dataset.xml consolidate.subreads.bam out_fn.subreadset.xml as described in the pages 21 to 22 of SMRT Tools Reference Guide.

However, we got the error:

Pbmerge command failed: pbmerge -o /tmp/tmp9MpKvHconsolidate-xml/outfile.bam /tmp/tmp9MpKvHconsolidate-xml/orig.subreadset.xml Message: ERROR: error: could not create filter from XML Property element: Name: qname Value: textFile1.list Operator: = reason: PbiQueryNameFilter error: requested QNAME (textFile1.list) is not a valid PacBio BAM QNAME. See spec for details

We know "textFile1.list" is not subread names but the file name listing QNAMEs... Could you let me know what's wrong in our code?

jrharting commented 6 years ago

Hi ke-shi,

Thanks for pointing this out, it appears there is a bug here. The dataset tool recognizes the qname.list file, but the underlying code for parsing bams does not.

The workaround at the moment if you want to create a dataset that properly filters based on a file of qnames is to use the keyword "qname_file" in your filter command.

dataset filter in.dataset.xml filtered.dataset.xml 'qname_file=/path/to/qname.list'

This will generate some errors ("unrecognized filter") and the dataset summarize tool will not properly count filtered records, but underlying tools will properly use the qname file for whitelisting.

Incidently, you can accomplish these two steps (dataset filter && dataset consolidate) in a single step using the tool bamsieve which takes pacbio bam/xml as input and subsets the data according to white-/blacklists or randomly. The tool should be available in your smrtanalysis v5.1 command path, bamsieve --help

ke-shi commented 6 years ago

Thanks jrharting,

"qname_file" is working!! bamsieve is also tested. It is easy to use but subsets the data based on ZMW not subread names...