BinPro / CONCOCT

Clustering cONtigs with COverage and ComposiTion
Other
122 stars 48 forks source link

Comment about map-bowtie2-markduplicates.sh line 110 #228

Closed franciscozorrilla closed 5 years ago

franciscozorrilla commented 5 years ago

Hi,

I had a small comment about line 110 of the map-bowtie2-markduplicates.sh script:

if [ ! -e ${REF}.1.bt2 ]

I noticed that running this script resulted in the re-creation of the bowtie index, even though it was already created in a previous step. It turns out that for "large" indexes, bowtie creates index files with the extension .bt1l instead of .bt2.

Perhaps this line of code could be amended to check whether one of the two file types is present before creating an index?

I am not sure exactly what the syntax would be, but something like this?

if [ -e ${REF}.1.bt2 | ${REF}.1.bt2l ]

Best wishes, Francisco

alneberg commented 5 years ago

Good catch! I'm actually thinking about moving away from the map-bowtie2-markduplicates.sh script. But if you still find it useful I should perhaps try to give it a brush up.

In the new version 1.0.0, there is a script called concoct_coverage_table.py which will generate the coverage table based on bam files directly. The largest improvement with this script is that it can be used on bam files generated on the original contigs and not the cutup-ones.

franciscozorrilla commented 5 years ago

Hi Johannes,

I do think it is quite a nifty script! But wouldnt you still need the map-bowtie2-markduplicates.sh script to generate the bam files to feed into concoct_coverage_table.py? Is this new script similar to get_input_table?

Cheers, Francisco

alneberg commented 5 years ago

You would still need to run a mapping step yes. There are however several ways one can do this but I guess it makes sense to at least support one way for users. The concoct_coverage_table.py is indeed similar to the gen_input_table.py but it takes sorted .bam files and a bed-file as input instead of the coverage histogram files used by gen_input_table.py. I added the basic usage to the README, please let me know if you're not sure about how to use it.

franciscozorrilla commented 5 years ago

I see! Thanks again for the info.

I had one more comment to make about this script, but it may not be relevant to you if you are moving away from its usage! Anyways here it goes:

I ran my pipeline on the toy dataset from the old readthedocs complete example, and everything worked fine. Now I am testing my pipeline on a real dataset, as I mentioned in a previous issue. However, the pipeline is erroring out in my bowtie rule, which essentially just runs the map-bowtie2-markduplicates.sh script. It looks like markduplicates is giving me some memory issues, which I am currently troubleshooting. Unfortunately, I need to essentially rerun the mapping every time I test a new set of parameters for mark duplicates. This is a huge pain, since I need to queue my job on one of the few available nodes that has enough RAM, and then need to wait around 3 hours for the mapping to finish.

Because of this, I think it would be more practical for real world usage to split this script into three (or at least two) separate scripts to allow for easier troubleshooting. i.e. first run the mapping for all samples, and then try to remove duplicates and calculate coverage statistics. I will probably do this for my own purposes, but it may be beneficial for others as well!

Regards, Francisco

alneberg commented 5 years ago

Yes, this is exactly the problem with this script. It's useful for a toy example but for a real world case it is better to have a snakemake pipeline or similar.