Open jbird9 opened 8 years ago
Hi Jordan,
Only a few parts of the operation are multi-threaded. I chose not to multi-thread reading the sam file due to I/O. Unfortunately, this step takes the longest.
You could subset the sam file. At least as an initial test to make sure the script is working with your data. It should not change the results as long as you still have sufficient coverage for the genomes of interest (min. 5x required for accurate iRep calculations).
However, it should not take prohibitively long to run through a 36 G sam file.
Chris
On Jun 24, 2016, at 10:19 AM, jbird9 notifications@github.com wrote:
Hi Chris,
I am attempting to implement iRep for a single metagenomic reconstruction and a very large sam file (36 G). The default thread behaviour is listed a 6 threads, but top reveals I am only using 1 cpu during the parsing mapping files phase.
Is the entire operation threaded or only parts?
Do you recommend subsampling such a large mapping file? Should the end results turn out similar?
Thanks,
Jordan
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
Thanks for the quick response Chris,
Messing around in samtools it seems like it was an absence of header (@SQ...) problem that was throwing a silent error.
Love the program.
Thanks,
Jordan
Unfortunately, this step takes the longest.
If iRep supported indexed BAM files, wouldn’t this solve this problem and greatly improve performance and disk load?
Best, Bela
Yes, it would! Adding support for BAM files is on my (long) to-do list! :-)
On Dec 5, 2016, at 6:17 AM, Bela Hausmann notifications@github.com wrote:
Unfortunately, this step takes the longest.
If iRep supported indexed BAM files, wouldn’t this solve this problem and greatly improve performance and disk load?
Best, Bela
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Yes, it would! Adding support for BAM files is on my (long) to-do list! :-)
Still any plan for this? I think this is a must-have if one wants to use iRep for large scale analysis (lot of genomes in a lot of samples).
Hi Nils,
I’m really not sure when I will have time to provide proper support for BAM files.
However, you can provide mapping files to iRep by pipe, so it should be straightforward to write a simple script to parallelize analysis of BAM files using samtools.
For a single file, this would look something like the following.
$ samtools view sample_data/l_gasseri.fna-vs-l_gasseri_sample1-shrunk.sam | iRep -f sample_data/l_gasseri.fna -s - -o test
Sorry this is not simpler, but hope that it helps.
Best,
Chris
On Apr 24, 2018, at 5:14 AM, Nils Giordano notifications@github.com wrote:
Yes, it would! Adding support for BAM files is on my (long) to-do list! :-)
Still any plan for this? I think this is a must-have if one wants to use iRep for large scale analysis (lot of genomes in a lot of samples).
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Hi Chris,
However, you can provide mapping files to iRep by pipe, so it should be straightforward to write a simple script to parallelize analysis of BAM files using samtools.
I didn't know you could pipe mapping files that way in iRep. That is indeed very useful, thank you. For now I managed to improve the parsing step of iRep by splitting my SAM file before feeding it to iRep:
samtools sort mapping.bam -o mapping.sorted.bam # Sort bam file
samtools index mapping.sorted.bam # Index bam file
contigs=`awk '{if(gsub(/^>/,"")){printf " " $1}}' genome.fna`; # Get contig names space separated
samtools view mapping.sorted.bam $contigs -h \ # Get reads with reference in $contigs
| samtools sort -n -o mapping#genome.sam \ # Sort by RNAME for iRep parsing
| iRep -f genome.fna -s - -o mapping#genome # iRep
Since I have >8000 genomes in my initial mapping.bam file (to reduce cross-mapping), the parsing by iRep was initially taking forever. It is know much more efficient since the file is reduced to the reads mapping the genome of interest.
If you ever plan to implement BAM file support, you might want to look at pysam since I think they already implement it (along other useful features).
Best, Nils
Hi Nils,
Thanks for sharing this workflow! Very nice solution for dealing with a large number of genomes and/or mappings. Indeed, the iRep parsing is very slow.
Thanks for pointing me to pysam, I’ll have a look and see if it can be incorporated.
Best,
Chris
On May 2, 2018, at 7:32 AM, Nils Giordano notifications@github.com wrote:
Hi Chris,
However, you can provide mapping files to iRep by pipe, so it should be straightforward to write a simple script to parallelize analysis of BAM files using samtools.
I didn't know you could pipe mapping files that way in iRep. That is indeed very useful, thank you. For now I managed to improve the parsing step of iRep by splitting my SAM file before feeding it to iRep:
samtools sort mapping.bam -o mapping.sorted.bam # Sort bam file
samtools index mapping.sorted.bam
Index bam file
contigs=
awk '{if(gsub(/^>/,"")){printf " " $1}}' genome.fna
; # Get contig names space separatedsamtools view mapping.sorted.bam $contigs -h \ # Get reads with reference in $contigs | samtools sort -n -o mapping#genome.sam \ # Sort by RNAME for iRep parsing | iRep -f genome.fna -s - -o mapping#genome # iRep Since I have >8000 genomes in my initial mapping.bam file (to reduce cross-mapping), the parsing by iRep was initially taking forever.
If you ever plan to implement BAM file support, you might want to look at pysam since I think they already implement it (along other useful features).
Best, Nils
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Hi Chris, I would like to know if it is possible to list more genome names when you lunch iRep in order to scan the sam file once and to perform the calculation for all the genomes listed. Thanks
Sorry, but I’m not entirely sure that I understand the question.
If you provide a FASTA file with a subset of the genomes reported in the SAM file, then iRep values will only be calculated for those genomes. Does this help?
On Jul 19, 2018, at 7:56 AM, stefcamp notifications@github.com wrote:
Hi Chris, I would like to know if it is possible to list more genome names when you lunch iRep in order to scan the sam file once and to perform the calculation for all the genomes listed. Thanks
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
I am sorry, probably my question was not very clear. I reformulate. When I lunch iRep, is it possible to specify more genomes in the command line? So, instead of running iRep for "genome1.fna" and then for "genome2.fna", etc.
iRep -f genome1.fna -s samfile.sam -o tag1 iRep -f genome2.fna -s samfile.sam -o tag2 etc.
to specify more genomes such as iRep -f genome1.fna,genome2.fna,genome3.fna -s samfile.sam -o tag1,tag2,tag3 in order to scan the sam file only once for all the genomes? This is because, I tried the procedure suggested by Niels, but it is not working and the coverage calculated by iRep is zero. Thanks
Thanks for clarifying. That is a good question. As you point out, it definitely can save a lot of time to read through the SAM file only once.
You can provide additional genomes with a space or wild card, e.g. -f genome1.fna genome2.fna
or -f *.fna
.
Then, supply a single output file name, e.g. -o iRep_output
, and all the results will be combined.
Hope this helps!
Chris
On Jul 19, 2018, at 10:53 AM, stefcamp notifications@github.com wrote:
I am sorry, probably my question was not very clear. I reformulate. When I lunch iRep, is it possible to specify more genomes in the command line? So, instead of running iRep for "genome1.fna" and then for "genome2.fna", etc.
iRep -f genome1.fna -s samfile.sam -o tag1 iRep -f genome2.fna -s samfile.sam -o tag2 etc.
to specify more genomes such as iRep -f genome1.fna,genome2.fna,genome3.fna -s samfile.sam -o tag1,tag2,tag3 in order to scan the sam file only once for all the genomes? This is because, I tried the procedure suggested by Niels, but it is not working and the coverage calculated by iRep is zero. Thanks
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Stefano
Hi Stefano,
Thanks for pointing that out - I updated the README to try to make this more clear.
Chris
On Jul 19, 2018, at 11:27 AM, Stefano Campanaro notifications@github.com wrote:
Thanks a lot Chris for your explanation, I did not understand this option by reading the manual. It is extremely useful! Best
Stefano
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Hi Chris,
I am attempting to implement iRep for a single metagenomic reconstruction and a very large sam file (36 G). The default thread behaviour is listed a 6 threads, but top reveals I am only using 1 cpu during the parsing mapping files phase.
Is the entire operation threaded or only parts?
Do you recommend subsampling such a large mapping file? Should the end results turn out similar?
Thanks,
Jordan