Closed danejo3 closed 1 year ago
@standage Whiteboard sesh
Before talking to Karla, for the past few days, I spent some time understanding how multi-input sampling is done in one of our internal workflows and tried to re-implement it for YEAT in samples.py
. You can find this in the first push of this PR.
The more I worked with it and tried to conform YEAT to it, the more I realized how incredibly difficult and self-imposing it was to force the same strategy onto YEAT.
Yesterday, I took a step back and asked myself, "If YEAT were independent of the current way we handle multi-input sampling, how would I go about it?"
In our discussion, we came up with this solution. This solution can resolve #40 and #41.
One of the neat features that YEAT has is its ability to assemble with multiple unique assembly algorithms. In #40, a feature request for enabling multiple assemblies on the sample algorithm was made. This is because, sometimes, if the sample is unknown, it can be helpful to run the same assembler multiple times with different flags or commands.
For example:
To enable this feature, in the config file, an additional parameter would be added to each assembly algorithm called "label".
[
{
"label": "spades_normal",
"algorithm": "spades",
"extra_args": ""
},
{
"label": "spades_meta",
"algorithm": "spades",
"extra_args": "--meta"
},
{
"label": "spades_krona",
"algorithm": "spades",
"extra_args": "--krona"
}
]
Along with these changes, issue thread #41 talks about enabling support for multiple input samples. This feature would be extremely helpful if you want to run YEAT with multiple samples in one run. (We'll need to enable grid support to divvy up the job.)
For example,
sample1
has 2 fastq filessample2
has 2 fastq filessample3
has 1 fastq fileFor sample1
, we want to run the label spades-meta
. For sample1
and sample2
, we want to run the label megahit-mins
. For sample3
, we want to run the label Pacbio-hifi
.
In the final proposed config file, the following would be created.
{
"samples": {
"sample1": [
"path_to_read1",
"path_to_read2"
],
"sample2": [
"path_to_read1",
"path_to_read2"
],
"sample3": [
"path_to_read"
]
},
"assemblers": [
{
"label": "spades-meta",
"algorithm": "spades",
"extra_args": "--meta",
"samples": [
"sample1"
]
},
{
"label": "megahit-mins",
"algorithm": "megahit",
"extra_args": "--min-count 5 --min-contig-len 300",
"samples": [
"sample1",
"sample2"
]
},
{
"label": "pacbio-hifi",
"algorithm": "canu",
"extra_args": "genomeSize=4.8m",
"samples": [
"sample3"
]
}
]
}
The reason why the config is laid out in this proposed manner is because:
1) there is less opportunities for users to introduce input error (tried to avoid users from copy pasting)
2) samples can be passed in from anywhere by providing a PATH
3) the CLI become immensely cleaner (for example, to run basic YEAT command: yeat -o sandbox -t 16 config.cfg
instead of yeat -o sandbox -t 16 config.cfg --pacbio read.fastq
and yeat -o sandbox -t 16 config.cfg --paired read1.fastq read2.fastq
)
4) also enables YEAT to assemble both short and long reads instead of two separate runs
Because of the config layout, the following changes to the snakemake file output directory structures will be made. Taking from the example above.
- seq
- -> blah (keep it the same way YEAT does it)
- analysis
- sample1
- spades-meta
- megahit-min
- sample2
- megahit-min
- sample3
- Pacbio-hifi
The proposed config file format looks good to me, as does the corresponding working directory format. Just one comment based on your example: you might consider making QUAST a subdirectory of each sample, rather than having a single QUAST directory containing every sample. That seems more consistent with the proposed layout.
Also, I think the term you're looking for is spades.py --corona
, not spades.py --krona
.
Running into a bug with MEGAHIT on my personal machine. Not sure what the problem is.
When running MEGAHIT with -t [2 or more cores]
on my machine, it seg faults with error code 11
.
Computer Specs:
In MEGAHIT's documentation:
Running MEGAHIT with -t 2
failed:
I've tried running MEGAHIT with different varying numbers: 3, 4, and 8, but same error. I've ran MEGAHIT by itself without the YEAT workflow but still getting the same error.
Interestingly, I've ran MEGAHIT on the server were we have plenty of cores and was able to run it successfully with -t 128
.
MEGAHIT runs fine on my machine when -t 1
only.
Still running into this issue with running custom thread numbers for MEGAHIT
on my MacBook Pro machine. Not sure why I'm segfault-ing.
Need to support grid ASAP. Enabling multiple assembly runs is too time consuming when running on a single machine.
This MR is ready for review! @standage
Let me know if there are any changes you would like me to pursue or add. I think this PR will be a huge leap stone capability for YEAT once merged. Will probably cut a release after this or when grid support is enabled.
Thanks for the code review!
In this PR, a major overhaul of CLI was made because of the new proposed changes to the required config file. In the new config file, users will now input samples with file paths and assembly algorithms with labels, extra_args, and names of samples.
For example:
By using this new config file, this PR will attempt to resolve #40 and #41.
Because of the multi-input samples, the analysis directory has changed to organize all of the assembly results by sample names. Below is a screenshot of the new output directories.
Notice that Bandage and Quast are now in the final assembly directory. Also, because of the new config file format, users can now run both paired and long read assemblies!