New config file format; enables multiple input samples and running multiple assembly algorithms (both paired and pacbio!)

danejo3 commented 1 year ago

In this PR, a major overhaul of CLI was made because of the new proposed changes to the required config file. In the new config file, users will now input samples with file paths and assembly algorithms with labels, extra_args, and names of samples.

For example:

{
    "samples": {
        "sample1": [
            "/Users/dane.jo/Desktop/test1/data/short1.fq.gz",
            "/Users/dane.jo/Desktop/test1/data/short2.fq.gz"
        ],
    "sample2": [
            "/Users/dane.jo/Desktop/test1/data/short1.fq.gz",
            "/Users/dane.jo/Desktop/test1/data/short2.fq.gz"
        ],
    "sample3": [
        "/Users/dane.jo/Desktop/test1/data/ecoli.fastq.gz"
    ]
    },
    "assemblers": [
        {
            "label": "spades-meta",
            "algorithm": "spades",
            "extra_args": "--meta",
            "samples": [
                "sample1"
            ]
        },
        {
            "label": "megahit-mins",
            "algorithm": "megahit",
            "extra_args": "--min-count 5 --min-contig-len 300",
            "samples": [
                "sample1",
        "sample2"
            ]
        },
    {
            "label": "hicanu",
            "algorithm": "canu",
            "extra_args": "genomeSize=4.6m",
            "samples": [
                "sample3"
            ]
        },
        {
            "label": "hiflye",
            "algorithm": "flye",
            "extra_args": "",
            "samples": [
                "sample3"
            ]
        }
    ]
}

By using this new config file, this PR will attempt to resolve #40 and #41.

Because of the multi-input samples, the analysis directory has changed to organize all of the assembly results by sample names. Below is a screenshot of the new output directories.

Notice that Bandage and Quast are now in the final assembly directory. Also, because of the new config file format, users can now run both paired and long read assemblies!

danejo3 commented 1 year ago

@standage Whiteboard sesh

Before talking to Karla, for the past few days, I spent some time understanding how multi-input sampling is done in one of our internal workflows and tried to re-implement it for YEAT in samples.py. You can find this in the first push of this PR.

The more I worked with it and tried to conform YEAT to it, the more I realized how incredibly difficult and self-imposing it was to force the same strategy onto YEAT.

Yesterday, I took a step back and asked myself, "If YEAT were independent of the current way we handle multi-input sampling, how would I go about it?"

In our discussion, we came up with this solution. This solution can resolve #40 and #41.

One of the neat features that YEAT has is its ability to assemble with multiple unique assembly algorithms. In #40, a feature request for enabling multiple assemblies on the sample algorithm was made. This is because, sometimes, if the sample is unknown, it can be helpful to run the same assembler multiple times with different flags or commands.

For example:

spades
spades-meta
spades-krona
etc...

To enable this feature, in the config file, an additional parameter would be added to each assembly algorithm called "label".

[
        {
        "label": "spades_normal",
        "algorithm": "spades",
        "extra_args": ""
    },
    {
        "label": "spades_meta",
        "algorithm": "spades",
        "extra_args": "--meta"
    },
    {
        "label": "spades_krona",
        "algorithm": "spades",
        "extra_args": "--krona"
    }
]

Along with these changes, issue thread #41 talks about enabling support for multiple input samples. This feature would be extremely helpful if you want to run YEAT with multiple samples in one run. (We'll need to enable grid support to divvy up the job.)

For example,

sample1 has 2 fastq files
sample2 has 2 fastq files
sample3 has 1 fastq file

For sample1, we want to run the label spades-meta. For sample1 and sample2, we want to run the label megahit-mins. For sample3, we want to run the label Pacbio-hifi.

In the final proposed config file, the following would be created.

{
    "samples": {
        "sample1": [
            "path_to_read1",
            "path_to_read2"
        ],
        "sample2": [
            "path_to_read1",
            "path_to_read2"
        ],
        "sample3": [
            "path_to_read"
        ]
    },
    "assemblers": [
        {
            "label": "spades-meta",
            "algorithm": "spades",
            "extra_args": "--meta",
            "samples": [
                "sample1"
            ]
        },
        {
            "label": "megahit-mins",
            "algorithm": "megahit",
            "extra_args": "--min-count 5 --min-contig-len 300",
            "samples": [
                "sample1",
                "sample2"
            ]
        },
        {
            "label": "pacbio-hifi",
            "algorithm": "canu",
            "extra_args": "genomeSize=4.8m",
            "samples": [
                "sample3"
            ]
        }
    ]
}

The reason why the config is laid out in this proposed manner is because: 1) there is less opportunities for users to introduce input error (tried to avoid users from copy pasting) 2) samples can be passed in from anywhere by providing a PATH 3) the CLI become immensely cleaner (for example, to run basic YEAT command: yeat -o sandbox -t 16 config.cfg instead of yeat -o sandbox -t 16 config.cfg --pacbio read.fastq and yeat -o sandbox -t 16 config.cfg --paired read1.fastq read2.fastq) 4) also enables YEAT to assemble both short and long reads instead of two separate runs

Because of the config layout, the following changes to the snakemake file output directory structures will be made. Taking from the example above.

- seq
- -> blah (keep it the same way YEAT does it)
- analysis
  - sample1
     - spades-meta
     - megahit-min
  - sample2
     - megahit-min
  - sample3
     - Pacbio-hifi

standage commented 1 year ago

The proposed config file format looks good to me, as does the corresponding working directory format. Just one comment based on your example: you might consider making QUAST a subdirectory of each sample, rather than having a single QUAST directory containing every sample. That seems more consistent with the proposed layout.

Also, I think the term you're looking for is spades.py --corona, not spades.py --krona.

danejo3 commented 1 year ago

Running into a bug with MEGAHIT on my personal machine. Not sure what the problem is.

When running MEGAHIT with -t [2 or more cores] on my machine, it seg faults with error code 11.

Computer Specs:

In MEGAHIT's documentation:

Running MEGAHIT with -t 2 failed:

I've tried running MEGAHIT with different varying numbers: 3, 4, and 8, but same error. I've ran MEGAHIT by itself without the YEAT workflow but still getting the same error.

Interestingly, I've ran MEGAHIT on the server were we have plenty of cores and was able to run it successfully with -t 128.

MEGAHIT runs fine on my machine when -t 1 only.

danejo3 commented 1 year ago

Still running into this issue with running custom thread numbers for MEGAHIT on my MacBook Pro machine. Not sure why I'm segfault-ing.

Need to support grid ASAP. Enabling multiple assembly runs is too time consuming when running on a single machine.

This MR is ready for review! @standage

Let me know if there are any changes you would like me to pursue or add. I think this PR will be a huge leap stone capability for YEAT once merged. Will probably cut a release after this or when grid support is enabled.

danejo3 commented 1 year ago

Thanks for the code review!

bioforensics / yeat

New config file format; enables multiple input samples and running multiple assembly algorithms (both paired and pacbio!) #43