threads and large files

jbird9 commented 8 years ago

Hi Chris,

I am attempting to implement iRep for a single metagenomic reconstruction and a very large sam file (36 G). The default thread behaviour is listed a 6 threads, but top reveals I am only using 1 cpu during the parsing mapping files phase.

Is the entire operation threaded or only parts?

Do you recommend subsampling such a large mapping file? Should the end results turn out similar?

Thanks,

Jordan

christophertbrown commented 8 years ago

Hi Jordan,

Only a few parts of the operation are multi-threaded. I chose not to multi-thread reading the sam file due to I/O. Unfortunately, this step takes the longest.

You could subset the sam file. At least as an initial test to make sure the script is working with your data. It should not change the results as long as you still have sufficient coverage for the genomes of interest (min. 5x required for accurate iRep calculations).

However, it should not take prohibitively long to run through a 36 G sam file.

Chris

On Jun 24, 2016, at 10:19 AM, jbird9 notifications@github.com wrote:

Hi Chris,

I am attempting to implement iRep for a single metagenomic reconstruction and a very large sam file (36 G). The default thread behaviour is listed a 6 threads, but top reveals I am only using 1 cpu during the parsing mapping files phase.

Is the entire operation threaded or only parts?

Do you recommend subsampling such a large mapping file? Should the end results turn out similar?

Thanks,

Jordan

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

jbird9 commented 8 years ago

Thanks for the quick response Chris,

Messing around in samtools it seems like it was an absence of header (@SQ...) problem that was throwing a silent error.

Love the program.

Thanks,

Jordan

nr0cinu commented 7 years ago

Unfortunately, this step takes the longest.

If iRep supported indexed BAM files, wouldn’t this solve this problem and greatly improve performance and disk load?

Best, Bela

christophertbrown commented 7 years ago

Yes, it would! Adding support for BAM files is on my (long) to-do list! :-)

On Dec 5, 2016, at 6:17 AM, Bela Hausmann notifications@github.com wrote:

Unfortunately, this step takes the longest.

If iRep supported indexed BAM files, wouldn’t this solve this problem and greatly improve performance and disk load?

Best, Bela

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

nigiord commented 6 years ago

Yes, it would! Adding support for BAM files is on my (long) to-do list! :-)

Still any plan for this? I think this is a must-have if one wants to use iRep for large scale analysis (lot of genomes in a lot of samples).

christophertbrown commented 6 years ago

Hi Nils,

I’m really not sure when I will have time to provide proper support for BAM files.

However, you can provide mapping files to iRep by pipe, so it should be straightforward to write a simple script to parallelize analysis of BAM files using samtools.

For a single file, this would look something like the following.

$ samtools view sample_data/l_gasseri.fna-vs-l_gasseri_sample1-shrunk.sam | iRep -f sample_data/l_gasseri.fna -s - -o test

Sorry this is not simpler, but hope that it helps.

Best,

Chris

On Apr 24, 2018, at 5:14 AM, Nils Giordano notifications@github.com wrote:

Yes, it would! Adding support for BAM files is on my (long) to-do list! :-)

Still any plan for this? I think this is a must-have if one wants to use iRep for large scale analysis (lot of genomes in a lot of samples).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

nigiord commented 6 years ago

Hi Chris,

However, you can provide mapping files to iRep by pipe, so it should be straightforward to write a simple script to parallelize analysis of BAM files using samtools.

I didn't know you could pipe mapping files that way in iRep. That is indeed very useful, thank you. For now I managed to improve the parsing step of iRep by splitting my SAM file before feeding it to iRep:

samtools sort mapping.bam -o mapping.sorted.bam         # Sort bam file
samtools index mapping.sorted.bam               # Index bam file
contigs=`awk '{if(gsub(/^>/,"")){printf " " $1}}' genome.fna`;  # Get contig names space separated
samtools view mapping.sorted.bam $contigs -h \          # Get reads with reference in $contigs
| samtools sort -n -o mapping#genome.sam \          # Sort by RNAME for iRep parsing
| iRep -f genome.fna -s - -o mapping#genome         # iRep

Since I have >8000 genomes in my initial mapping.bam file (to reduce cross-mapping), the parsing by iRep was initially taking forever. It is know much more efficient since the file is reduced to the reads mapping the genome of interest.

If you ever plan to implement BAM file support, you might want to look at pysam since I think they already implement it (along other useful features).

Best, Nils

christophertbrown commented 6 years ago

Hi Nils,

Thanks for sharing this workflow! Very nice solution for dealing with a large number of genomes and/or mappings. Indeed, the iRep parsing is very slow.

Thanks for pointing me to pysam, I’ll have a look and see if it can be incorporated.

Best,

Chris

On May 2, 2018, at 7:32 AM, Nils Giordano notifications@github.com wrote:

Hi Chris,

However, you can provide mapping files to iRep by pipe, so it should be straightforward to write a simple script to parallelize analysis of BAM files using samtools.

I didn't know you could pipe mapping files that way in iRep. That is indeed very useful, thank you. For now I managed to improve the parsing step of iRep by splitting my SAM file before feeding it to iRep:

samtools sort mapping.bam -o mapping.sorted.bam # Sort bam file

samtools index mapping.sorted.bam

Index bam file

contigs= awk '{if(gsub(/^>/,"")){printf " " $1}}' genome.fna; # Get contig names space separated

samtools view mapping.sorted.bam $contigs -h \ # Get reads with reference in $contigs | samtools sort -n -o mapping#genome.sam \ # Sort by RNAME for iRep parsing | iRep -f genome.fna -s - -o mapping#genome # iRep Since I have >8000 genomes in my initial mapping.bam file (to reduce cross-mapping), the parsing by iRep was initially taking forever.

If you ever plan to implement BAM file support, you might want to look at pysam since I think they already implement it (along other useful features).

Best, Nils

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

stefcamp commented 6 years ago

Hi Chris, I would like to know if it is possible to list more genome names when you lunch iRep in order to scan the sam file once and to perform the calculation for all the genomes listed. Thanks

christophertbrown commented 6 years ago

Sorry, but I’m not entirely sure that I understand the question.

If you provide a FASTA file with a subset of the genomes reported in the SAM file, then iRep values will only be calculated for those genomes. Does this help?

On Jul 19, 2018, at 7:56 AM, stefcamp notifications@github.com wrote:

Hi Chris, I would like to know if it is possible to list more genome names when you lunch iRep in order to scan the sam file once and to perform the calculation for all the genomes listed. Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

stefcamp commented 6 years ago

I am sorry, probably my question was not very clear. I reformulate. When I lunch iRep, is it possible to specify more genomes in the command line? So, instead of running iRep for "genome1.fna" and then for "genome2.fna", etc.

iRep -f genome1.fna -s samfile.sam -o tag1 iRep -f genome2.fna -s samfile.sam -o tag2 etc.

to specify more genomes such as iRep -f genome1.fna,genome2.fna,genome3.fna -s samfile.sam -o tag1,tag2,tag3 in order to scan the sam file only once for all the genomes? This is because, I tried the procedure suggested by Niels, but it is not working and the coverage calculated by iRep is zero. Thanks

christophertbrown commented 6 years ago

Thanks for clarifying. That is a good question. As you point out, it definitely can save a lot of time to read through the SAM file only once.

You can provide additional genomes with a space or wild card, e.g. -f genome1.fna genome2.fna or -f *.fna.

Then, supply a single output file name, e.g. -o iRep_output, and all the results will be combined.

Hope this helps!

Chris

On Jul 19, 2018, at 10:53 AM, stefcamp notifications@github.com wrote:

I am sorry, probably my question was not very clear. I reformulate. When I lunch iRep, is it possible to specify more genomes in the command line? So, instead of running iRep for "genome1.fna" and then for "genome2.fna", etc.

iRep -f genome1.fna -s samfile.sam -o tag1 iRep -f genome2.fna -s samfile.sam -o tag2 etc.

to specify more genomes such as iRep -f genome1.fna,genome2.fna,genome3.fna -s samfile.sam -o tag1,tag2,tag3 in order to scan the sam file only once for all the genomes? This is because, I tried the procedure suggested by Niels, but it is not working and the coverage calculated by iRep is zero. Thanks

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

stefcamp commented 6 years ago

Thanks a lot Chris for your explanation, I did not understand this option by reading the manual. It is extremely useful! Best

Stefano

christophertbrown commented 6 years ago

Hi Stefano,

Thanks for pointing that out - I updated the README to try to make this more clear.

Chris

On Jul 19, 2018, at 11:27 AM, Stefano Campanaro notifications@github.com wrote:

Thanks a lot Chris for your explanation, I did not understand this option by reading the manual. It is extremely useful! Best

Stefano

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

christophertbrown / iRep

threads and large files #1

Index bam file

Thanks a lot Chris for your explanation, I did not understand this option by reading the manual. It is extremely useful! Best