Updating the config file for hybrid assembly support

danejo3 commented 10 months ago

The config file that the user creates for YEAT has been great; however, one of the problems that arises is its easy-to-make mistakes. For example, a user may forget to add in a key, a comma, a bracket, etc., or copy-paste the incorrect data. While these mistakes can be cumbersome to fix on the user's end, it has been a great tool for creating the instructions to run multiple samples and assemblies in one instruction.

Here is a configuration file without hybrid assembly. In this file, Unicycler will assemble twice with only 1) paired-end and 2) pacbio-hifi reads.

{
    "samples": {
        "sample1": {
            "paired": [
                "yeat/tests/data/short_reads_1.fastq.gz",
                "yeat/tests/data/short_reads_2.fastq.gz"
            ]
        },
        "sample2": {
            "pacbio-hifi": [
                "yeat/tests/data/long_reads.fastq.gz"
            ]
        }
    },
    "assemblers": [
        {
            "label": "unicycler-default",
            "algorithm": "Unicycler",
            "extra_args": "",
            "samples": [
                "sample1",
                "sample2"
            ]
        }
    ]
}

To enable hybrid assembly with the current configuration file, there are a few ways to go about it.

One way could be to combine samples together in the assemblers:sample list sections.

"assemblers": [
    {
        "label": "unicycler-default",
        "algorithm": "Unicycler",
        "extra_args": "",
        "samples": [
            ["sample1", "sample2"]
        ]
    }
]

Another way is to create a new sample and add the two different kinds of reads together.

"sample3": {
    "paired": [
        "yeat/tests/data/short_reads_1.fastq.gz",
        "yeat/tests/data/short_reads_2.fastq.gz"
    ],
    "pacbio-hifi": [
        "yeat/tests/data/long_reads.fastq.gz"
    ]
}

The problem with the first suggestion is that when you combine samples together in a list [sample1 and "sample2"], the user cannot specify the "sample" or analysis label of the hybrid assembly. In this case, YEAT would have to create some arbitrary label for the two combined samples. For example, hybrid1.

sandbox/
├─ analysis/
│  ├─ hybrid1/
├─ seq/
│  ├─ input/
│  │  ├─ hybrid1/

To resolve this issue, we could do option 2; however, this introduces duplicate data in the config file.

When I designed the config file, I wanted to prevent users from copy-pasting duplicate data in the config. By doing option 1, we can keep this goal; however, we lose the ability to label.

After spending some time thinking about the design of the config file, here is what I've come up with:

{
   "reads":{
      "paired-reads-1":{
         "paired": [
               "/Users/dane.jo/Desktop/test1/data/hybrid/short_reads_1.fastq.gz",
               "/Users/dane.jo/Desktop/test1/data/hybrid/short_reads_2.fastq.gz"
            ]
         }
      },
      "long-reads-1":{
         "pacbio-corr":[
            "/Users/dane.jo/Desktop/test1/data/hybrid/long_reads_low_depth.fastq.gz"
         ]
      }
   },
   "samples": {
      "sample1": ["paired-reads-1"],
      "sample2": ["long-reads-1"],
      "sample3": ["short-reads-1", long-reads-1] ,
   },
   "assemblers":[
      {
         "label":"unicycler-default",
         "algorithm":"unicycler",
         "extra_args":"",
         "samples": [
            "sample1",
            "sample2",
            "sample3"
         ]
      }
   ]
}

Notice that hybrid assembly is enabled by combining the two ideas together: 1) give it a label and 2) put the two types of reads in a list.

"samples": {
  "sample3": ["short-reads-1", long-reads-1] 
}

In the new configuration file, we break up the nested dictionary into three parts: 1) reads 2) samples 3) assemblers

What influenced my decision to go about this design? 1) You can have many reads to a sample. (A sample has many reads.) You can run many samples to an assembly algorithm. (An assembly algorithm can run multiple samples.) 2) A request was made to add more options to combine reads together. Currently, YEAT is able to process multiple long reads together when the sample has multiple paths in the list.

 "long-reads-1":{
   "pacbio-corr":[
      "/Users/dane.jo/Desktop/test1/data/hybrid/long_reads_low_depth.fastq.gz",
      "another/path/here.gz",
      "here/is/another/path/here.gz"
   ]
}

However, this is not true for paired-end reads because YEAT cannot accept more than two reads. To fix this, I would like to update this with the following:

"paired-reads-1":{
   "paired": {
      "R1": [
         "/Users/dane.jo/Desktop/test1/data/hybrid/short_reads_1.fastq.gz"
      ],
      "R2": [
         "/Users/dane.jo/Desktop/test1/data/hybrid/short_reads_2.fastq.gz"
      ]
   }
}

If the user would like to add additional R1 and R2 reads, they can just append the paths to the list.

"paired-reads-1":{
   "paired": {
      "R1": [
         "/Users/dane.jo/Desktop/test1/data/hybrid/short_reads_1.fastq.gz",
         "another_read.gz",
         "another_another_read.gz"
      ],
      "R2": [
         "/Users/dane.jo/Desktop/test1/data/hybrid/short_reads_2.fastq.gz",
         "another_read.gz",
         "another_another_read.gz"
      ]
   }
}

In addition to adding this support, ways to combine reads with different labels were requested.

In the sample section of the configuration, users can combine reads with different labels by putting them into a list.

{
   "reads":{
      "paired-reads-1":{
         "paired": {
            "R1": [
               "/Users/dane.jo/Desktop/test1/data/hybrid/short_reads_1.fastq.gz"
            ],
            "R2": [
               "/Users/dane.jo/Desktop/test1/data/hybrid/short_reads_2.fastq.gz"
            ]
         }
      },
      "paired-reads-1.5":{
         "paired": {
            "R1": [
               "/Users/dane.jo/Desktop/test1/data/hybrid/short_reads_1.fastq.gz"
            ],
            "R2": [
               "/Users/dane.jo/Desktop/test1/data/hybrid/short_reads_2.fastq.gz"
            ]
         }
      },
      "long-reads-1":{
         "pacbio-corr":[
            "/Users/dane.jo/Desktop/test1/data/hybrid/long_reads_low_depth.fastq.gz"
         ]
      },
      "long-reads-2":{
         "pacbio-corr":[
            "/Users/dane.jo/Desktop/test1/data/hybrid/long_reads_high_depth.fastq.gz"
         ]
      }
   },
   "samples": {
      "sample1": ["paired-reads-1", "paired-reads-1.5"],
      "sample2": ["long-reads-1", "long-reads-2"]
   },
   "assemblers":[
      {
         "label":"unicycler-default",
         "algorithm":"unicycler",
         "extra_args":"",
         "samples": [
            "sample1",
            "sample2"
         ]
      }
   ]
}

This new design can resolve all of the design requests; however, it introduces a problem: too much labeling. Users will need to label the reads and samples. I'm not sure if this will be a huge problem, but too many labels could introduce too much overhead when filling out the configuration file. In addition, adding all of these suggestions can make the configuration file very cluttered and not appealing to read or edit.

danejo3 commented 9 months ago

UPDATE:

New proposed config file.

Instead of having multiple nested dictionaries, we'll drop it back down to 2.
Samples can have more than one read type
Instead of R1 and R2 for paired sample reads, we can use tuples to reduce the nesting
- For single reads, still use tuple (for example, ("single-end.fq",))
Users can add multiple sample reads together in their respective lists (for example, "paired": [("r1", "r2"), ("r1.1", "r2.2")]"
Instead of assemblers, change to assemblies
- Doing this to make the first level dictionaries: "samples" and "assemblies"
- For each assembly in "assemblers", make the label the key.
Users can put the key word mode in each assembly-algorithm dictionary.
- options will be: "Illumina", "long", or "hybrid"
- setting "mode": "hybrid" will tell the assembler to go into hybrid mode
- if the user does not specify or add to the dictionary, YEAT will go with the whatever is found in the sample.
  - only Illumina reads -> short read assemblers
  - only long reads -> long read assemblers
  - both paired-end and long reads -> hybrid assemblers

Example config file:

{
    "samples": {
        "sample1": {
            "paired": [
                ("yeat/tests/data/short_reads_1.fastq.gz", "yeat/tests/data/short_reads_2.fastq.gz")
            ],
            "pacbio-hifi": [
                ("yeat/tests/data/long_reads.fastq.gz",)
            ]
        }
    },
    "assemblies": {
        "unicycler-default": {
            "algorithm": "Unicycler",
            "extra_args": "",
            "samples": [
                "sample1"
            ]
            "mode": "hybrid"
        }
    }
}

danejo3 commented 8 months ago

Just realized that tuple are impossible to use in JSON. When python reads in tuples and dumps it into a string, it converts them into arrays!

 "paired": [
     ("yeat/tests/data/short_reads_1.fastq.gz", "yeat/tests/data/short_reads_2.fastq.gz")
 ]

"paired": [
    [
        "yeat/tests/data/short_reads_1.fastq.gz",
        "yeat/tests/data/short_reads_2.fastq.gz"
    ]
]

danejo3 commented 8 months ago

UPDATE: To avoid unnecessary repetition and tedious copy-pasting, we are only going to use list-of-lists for paired-end reads and everything else with a flat list.

{
    "samples": {
        "sample1": {
            "paired": [
                ["yeat/tests/data/short_reads_1.fastq.gz", "yeat/tests/data/short_reads_2.fastq.gz"]
            ],
            "pacbio-hifi": [
                "yeat/tests/data/long_reads.fastq.gz"
            ]
        }
    },
    "assemblies": {
        "unicycler-default": {
            "algorithm": "Unicycler",
            "extra_args": "",
            "samples": [
                "sample1"
            ]
            "mode": "hybrid"
        }
    }
}

danejo3 commented 1 week ago

Hybrid support was added in #68 .

bioforensics / yeat

Updating the config file for hybrid assembly support #56