Closed DennisSchmitz closed 4 years ago
@samnooij can you look at this bash script? It seems like values are computed, and stored in variables, but they aren't actually printed or used? Am I misunderstanding something?
Right, those variables don't do anything. They are more of a note to myself and the reader telling how you can count different things with samtools
. The actual 'work' is done by the printf
line (which makes a header), and the final samtools view [blablabla]
line. The comments in the script should be updated to reflect this.
Also, I didn't really consider the speed when developing this. It was primarily the functionality that I had to get working. I think it took about half a minute per sample when I tested, which was fine. I can imagine it's longer with more/bigger/more complex samples, which is undesirable and unnecessary. Indeed parallelisation should be a good way to improve this. The easiest way to do that now would be to use bash parallelisation in the script. (I.e. the parallel
command, or forking processes to the background in the for-loop.) Or would you prefer to have Snakemake handle everything (and probably make another rule to concatenate separate count tables)?
I'll see if I can fetch some test data and the pipeline to fix it this week.
Aah ok. Yeah I tried it on the Cameroon dataset and there the rule took >1h, so that might be a good test dataset. I think the Snakemake method would be preferable, but depends on how easy it is. I haven't got the time to look into it yet. But by commenting out those variables and maybe by including the -@
flag of samtools
will help a lot.
If you could look into it I'd by much obliged! Leme know if I can help.
I have not tested it yet, but I proposed an update in https://github.com/DennisSchmitz/Jovian/commit/97e62a38cbd85058bf7732d8f1e7114aa081cb6a (mapped_reads
branch). Can you test this easily? Otherwise I'll see if I can do it later.
Tested it on a small demo set, after making the bash code compatible with Snakemake, it works. Tonight I'll try it with a larger dataset and afterwards merge it into dev
.
Thanks!
After trying it on a larger dataset (the public Cameroon one) it gives the following error in rule quantify_output
:
Done counting!===============================================] 100.0% - Reading files [ 189 / 189 ]
Traceback (most recent call last):
File "bin/quantify_profiles.py", line 767, in <module>
main()
File "bin/quantify_profiles.py", line 653, in main
classified_nrs = sum_superkingdoms(arguments.classified, arguments.mapped_reads)
File "bin/quantify_profiles.py", line 318, in sum_superkingdoms
[[ "mapped_reads" ]])
File "/mnt/scratch_dir/schmitzd/20191019_Jovian_Cameroon_dataset_KronaLCA/.snakemake/conda/87d87e66/lib/python3.7/site-packages/pandas/core/frame.py", line 2682, in __getitem__
return self._getitem_array(key)
File "/mnt/scratch_dir/schmitzd/20191019_Jovian_Cameroon_dataset_KronaLCA/.snakemake/conda/87d87e66/lib/python3.7/site-packages/pandas/core/frame.py", line 2726, in _getitem_array
indexer = self.loc._convert_to_indexer(key, axis=1)
File "/mnt/scratch_dir/schmitzd/20191019_Jovian_Cameroon_dataset_KronaLCA/.snakemake/conda/87d87e66/lib/python3.7/site-packages/pandas/core/indexing.py", line 1327, in _convert_to_indexer
.format(mask=objarr[mask]))
KeyError: "['mapped_reads'] not in index"
When I inspect the file results/counts/Mapped_read_counts.tsv
I see that the file is "weird" (see below). It looks like there is one block without a filename in the first column while latter lines do have this structure. @samnooij, is this filestructure intended?
mapped_reads scaffold_name
59 NODE_1000_length_878_cov_0.407457
55 NODE_1001_length_878_cov_0.259654
788 NODE_1002_length_877_cov_6.033333
140 NODE_1003_length_877_cov_1.030667
..........
..........
74 NODE_999_length_878_cov_0.548602
456 NODE_99_length_2032_cov_1.351706
6354 NODE_9_length_5644_cov_7.284394
results/counts/Mapped_read_counts-SRR7892427.tsv: 58 NODE_1000_length_1051_cov_0.247835
results/counts/Mapped_read_counts-SRR7892427.tsv: 794 NODE_1001_length_1050_cov_4.755146
results/counts/Mapped_read_counts-SRR7892427.tsv: 52 NODE_1002_length_1050_cov_0.380282
..........
..........
It's indeed the wonky filestructure. If I make the file like below, it does work. I`ll try to update it into the Snakefile next week.
awk 'FNR==1 && NR!=1 { while (/^mapped_reads/) getline; } 1 {print}' results/counts/Mapped_read_counts-*.tsv > results/counts/Mapped_read_counts.tsv
After making the file like that, no errors.
@samnooij, if I understand the code correctly you match on the assumption that the scaffold name is unique? While the odds are good that they are all unique, that's not a certainty. Can the filename be added to force it to be unique?
Indeed, I assume that there will be no scaffolds with the same number, length and coverage. It would be good to still include a check whether this is the case and make sure the code is robust even to unlikely events such as those. Even if that makes the script a bit more complicated. I am going to work on an update that does this.
And as you guessed, the file is supposed to be a two-columns tab-separated table with the number of mapped reads and the scaffold name as two columns. I am going to add a third with sample/file name to exclude the possibility of duplicated scaffold names.
Perfect thank you! Can you also included the above 'awk' fix? I'm very tight on time and if you're fixing the other bug anyway it would really help me if you also included that. If it's not possible, leme know :)
The double curly braces for snakemake compatibility? I'm thinking about changing the rule to a new Python script to include duplicate checking together with the concatenation of all those separate tables. That seems more logical and probably easier to me than adding lines of shell commands. So that won't use the awk
line anymore.
I will have a look at bin/quantify_profiles.py
too, as this will then also have to include the file/sample names to prevent errors due to duplicated scaffold names.
Perfect! Whatever works best for you :)
I saw you pushed 2ea9819b and 1ede353a for this issue, is it in a testable state? Because I wanna test and release the stable IGVjs fix and then I can also take along this fix.
It is almost. Well, in fact I think it works already, yes. I am testing it too - with an additional update in quantify_profiles.py
to merge tables on both scaffold_name and sample_name (to avoid errors with duplicated names). And that part works now. However, I am getting problem with draw_heatmaps.py
again, as in #42.
I will make a push with my little new update and then you can test it too. I am hoping to get a grip on the heatmaps error... :\
P.S. I think I'm closing in on the draw_heatmaps.py
bug... At one point it makes a table of number of read pairs / sample. In my example, the sample names end up like "SRR7892426_", with an underscore attached. The consequent merger of dataframes, by Sample name, fails because of this. Therefore, the fix is probably making the sample column _R1/1 resistant by applying the same RegEx trick. (getting a bit off-topic here...)_
P.P.S This issue seems to have been solved in the master branch, but looks defective in the dev branch! I am now going to try the code from the master branch and see if that solves my current testing issue.
In https://github.com/DennisSchmitz/Jovian/commit/c7fe25e5d3d432b66e353e49efffcd3e3ce3c23e I took the draw_heatmaps.py
script from the dev
branch, and changed the lines for def read_numbers(infile)
to the same function as it is on the master
branch. With that, I was able to run a complete analysis.
In other words: it all works now. Please double check if it works for you too. Hopefully we can finally get that bug out and speed up the counting. ;)
I did a small test, it seems to work. I'll do a bigger test overnight and let you know tomorrow.
Also works on larger datasets, it's included in the dev
branch and will be in the next release.
The count_mapped_reads rule is extremely slow. It's currently a bash script that is wrapped in snakemake... It works with only 1 thread and processes all bam files in an analysis. It does no splitting/parallelization.
This should be made parallel for each input bam file via Snakemake and we might use the
parallel
function to speed it up further.