biosails / pheniqs

Fast and accurate sequence demultiplexing
Other
26 stars 4 forks source link

Quickstart Example not working for me #34

Closed adrianomartinelli closed 2 years ago

adrianomartinelli commented 2 years ago

Hi! I would like to use your tool to demultiplex my files. However, when I try out the Quickstart example I do not manage to demultiplex the given files. Working on a Mac with python 3.6. In the attached folder I run pheniqs mux --config pheniqs.json > report.txt

In the end I would like to simply to demultiplex a single FASTQ file (my-example) with a sample code at position 1:5 of the read.

If you could point me to my mistake that would be highly appreciated. Best, Adriano

pheniqs_quickstart.zip my-example.zip

moonwatcher commented 2 years ago

Hi Adrian

Thank you for considering Pheniqs for your project and sorry for the late reply, I was on a bonding leave with my firstborn.

Your config file had a few mistakes but I think the main issue is that the sample decoder default to the passthrough algorithm (just spitting back out the reads as they came in, which is used for quickly repackaging reads from one layout to another). If you specify "algorithm": "pamld" you will get the expected output.

Your transform directive was malformed. "transform": ["0:1:5"], should be "transform": { "token": [ "0:1:5" ] },. The validator actually did the right thing here and gave the following error message:

JSON directive validation error
Error description: Expected type object but actual is array
Path in document: /sample/transform
Document URL: testSample.json?format=json

You also did not specify an output template directive so the entire read will be emitted on output reads. you probably wanted to trim the actual decoded barcode from output, in which case you should add something like "template": { "transform": { "token": [ "0:1:5" ] } }.

Notice that if you execute with --validate pheniqs will tell you exactly what it plans to do, this often helps debugging configuration "misunderstandings". so for instance after correcting your config to

{
    "input": [
        "test.fastq"
    ],
    "sample": {
        "transform": { "token": [ "0:1:5" ] },
        "codec": {
            "@FirstCode": { "barcode": [ "AAAA" ] },
            "@SecondCode": { "barcode": [ "CCCC" ] }
        }
    },
    "template": { "transform": { "token": [ "0:1:5" ] } }
}

executing pheniqs mux --config testSample.json --validate returns this report and you can see the sample decoding algorithm is passthrough.

Environment

    Base input URL                              /Users/lg/Desktop/ticket 34/my-example
    Base output URL                             /Users/lg/Desktop/ticket 34/my-example
    Platform                                    ILLUMINA
    Quality tracking                            disabled
    Filter incoming QC failed reads             disabled
    Filter outgoing QC failed reads             disabled
    Input Phred offset                          33
    Output Phred offset                         33
    Leading segment index                       0
    Default output format                       sam
    Default output compression                  unknown
    Default output compression level            5
    Feed buffer capacity                        2048
    Threads                                     8
    Decoding threads                            1
    HTSLib threads                              8

Input

    Input segment cardinality                   1

    Input segment No.0 : /Users/lg/Desktop/ticket 34/my-example/test.fastq?format=fastq

    Input feed No.0
        Type : fastq
        Compression : unknown
        Resolution : 1
        Phred offset : 33
        Platform : ILLUMINA
        Buffer capacity : 2048
        URL : /Users/lg/Desktop/ticket 34/my-example/test.fastq?format=fastq

Output transform

    Output segment cardinality                  1

    Token No.0
        Length        4
        Pattern       0:1:5
        Description   cycles 1 to 5 of input segment 0

    Assembly instruction
        Append token 0 of input segment 0 to output segment 0

Sample decoding

    Decoding algorithm                          passthrough
    Shannon bound                               1
    Segment cardinality                         1
    Nucleotide cardinality                      4

    Transform

        Token No.0
            Length        4
            Pattern       0:1:5
            Description   cycles 1 to 5 of input segment 0

        Assembly instruction
            Append token 0 of input segment 0 to output segment 0

    Barcode undetermined
        ID : undetermined
        PU : undetermined
        Segment No.0  : /dev/stdout?format=sam&compression=none

    Barcode @FirstCode
        ID : AAAA
        PU : AAAA
        Concentration : 0.495
        Barcode       : AAAA
        Segment No.0  : /dev/stdout?format=sam&compression=none

    Barcode @SecondCode
        ID : CCCC
        PU : CCCC
        Concentration : 0.495
        Barcode       : CCCC
        Segment No.0  : /dev/stdout?format=sam&compression=none

    Output feed No.0
        Type : sam
        Resolution : 1
        Phred offset : 33
        Platform : ILLUMINA
        Buffer capacity : 2048
        URL : /dev/stdout?format=sam&compression=none

So the final config I suggest is

{
    "input": [
        "test.fastq"
    ],
    "sample": {
        "algorithm": "pamld",
        "transform": { "token": [ "0:1:5" ] },
        "codec": {
            "@FirstCode": { "barcode": [ "AAAA" ] },
            "@SecondCode": { "barcode": [ "CCCC" ] }
        }
    },
    "template": { "transform": { "token": [ "0:1:5" ] } }
}

which yields the expected output:

@HD VN:1.0  SO:unknown  GO:query
@RG ID:undetermined PU:undetermined
@RG ID:AAAA BC:AAAA PU:AAAA
@RG ID:CCCC BC:CCCC PU:CCCC
@PG ID:pheniqs  PN:pheniqs  CL:pheniqs mux --config testSample.json VN:2.1.0-37-g684f02b7b3bfaec7040337884b7f13ed6eb3fd58
IDENAIFIER  76  *   0   0   *   *   0   0   AAAA    CCCC    RG:Z:AAAA   BC:Z:AAAA   QT:Z:CCCC   XB:f:7.90337e-05
IDENAIFIER  76  *   0   0   *   *   0   0   CCCC    CCCC    RG:Z:CCCC   BC:Z:CCCC   QT:Z:CCCC   XB:f:7.90337e-05
{
    "incoming": {
        "count": 2,
        "pf count": 2,
        "pf fraction": 1.0
    },
    "outgoing": {
        "count": 2,
        "pf count": 2,
        "pf fraction": 1.0
    },
    "sample": {
        "average classified confidence": 0.999920966315065,
        "average pf classified confidence": 0.999920966315065,
        "classified": [
            {
                "BC": "AAAA",
                "ID": "AAAA",
                "PU": "AAAA",
                "average confidence": 0.999920966315065,
                "average pf confidence": 0.999920966315065,
                "barcode": [
                    "AAAA"
                ],
                "concentration": 0.495,
                "count": 1,
                "estimated concentration": 0.5,
                "index": 1,
                "pf count": 1,
                "pf fraction": 1.0,
                "pf pooled classified fraction": 0.5,
                "pf pooled fraction": 0.5,
                "pooled classified fraction": 0.5,
                "pooled fraction": 0.5
            },
            {
                "BC": "CCCC",
                "ID": "CCCC",
                "PU": "CCCC",
                "average confidence": 0.999920966315065,
                "average pf confidence": 0.999920966315065,
                "barcode": [
                    "CCCC"
                ],
                "concentration": 0.495,
                "count": 1,
                "estimated concentration": 0.5,
                "index": 2,
                "pf count": 1,
                "pf fraction": 1.0,
                "pf pooled classified fraction": 0.5,
                "pf pooled fraction": 0.5,
                "pooled classified fraction": 0.5,
                "pooled fraction": 0.5
            }
        ],
        "classified count": 2,
        "classified fraction": 1.0,
        "classified pf fraction": 1.0,
        "count": 2,
        "index": 0,
        "pf classified count": 2,
        "pf classified fraction": 1.0,
        "pf count": 2,
        "pf fraction": 1.0,
        "unclassified": {
            "ID": "undetermined",
            "PU": "undetermined",
            "count": 0,
            "index": 0,
            "pf count": 0,
            "pf fraction": 0.0,
            "pf pooled fraction": 0.0,
            "pooled fraction": 0.0
        }
    }
}

Please let us know if there are any other issues you are experiencing.

Regards,

L.