Closed palomo11 closed 4 years ago
This seems like the FASTAs used for each initial run may have been different - can you check to see that the exact same FASTA was used for each run being combined?
Has this script successfully worked for you on other genomes thus far? Thanks, Alex
Thanks for the answer.
I'm quite sure the FASTA used was the same... This is the output from the profiling for each sample:
$ cat sample1-raw_sort_filter_info_MAG1.log
08-16 23:51 DEBUG !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
08-16 23:51 DEBUG Command to run was: /home/projects/env_10000/people/name/scripts/soil_popgen/inStrain_lite/bin/inStrain_lite -p 28 -s 30 -c 0.96 -o sample1-raw_sort_filter_info_MAG1 --min_breadth_cov 0.5,5 sample1-raw_sort_filtered_sort.bam MAG1.fa
08-16 23:52 INFO Total read count, divided by 2 453112
08-16 23:52 INFO Mean read pair sequences length 157.88698430631808
08-16 23:52 INFO Total FASTA length 3728192
08-16 23:52 INFO Expected total coverage 19.189056581046366
08-16 23:52 INFO Mapped read pairs 226588 50%
08-16 23:52 INFO Median end-to-end insert length 84.0
08-16 23:52 INFO Read pairs which pass insert distance filters: 222574 49%
08-16 23:52 INFO Read pairs which also meet min_mapq of 2 226588 50%
08-16 23:52 INFO Read pairs which also pass final read pair PID >0.96% 226524 49%
08-16 23:52 INFO Final expected coverage 9.593173107233854
08-16 23:52 INFO Finalizing and saving
08-16 23:52 INFO Mean coverage 5.708904745249172
08-16 23:52 INFO Mean breadth 0.9003111964190685
08-16 23:52 INFO Breadth above minimum coverage 0.5071997364942578
08-16 23:52 INFO Passes breadth-cov filter. Will output tables
$ cat sample2-raw_sort_filter_info_MAG1.log
08-16 23:58 DEBUG !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
08-16 23:58 DEBUG Command to run was: /home/projects/env_10000/people/name/scripts/soil_popgen/inStrain_lite/bin/inStrain_lite -p 28 -s 30 -c 0.96 -o sample2-raw_sort_filter_info_MAG1 --min_breadth_cov 0.5,5 sample2-raw_sort_filtered_sort.bam MAG1.fa
08-16 23:59 INFO Total read count, divided by 2 822294
08-16 23:59 INFO Mean read pair sequences length 228.76555678713453
08-16 23:59 INFO Total FASTA length 3728192
08-16 23:59 INFO Expected total coverage 50.45677496028102
08-16 23:59 INFO Mapped read pairs 411396 50%
08-16 23:59 INFO Median end-to-end insert length 147.0
08-16 23:59 INFO Read pairs which pass insert distance filters: 407184 49%
08-16 23:59 INFO Read pairs which also meet min_mapq of 2 411396 50%
08-16 23:59 INFO Read pairs which also pass final read pair PID >0.96% 410883 49%
08-16 23:59 INFO Final expected coverage 25.212188178443654
08-17 00:00 INFO Finalizing and saving
08-17 00:00 INFO Mean coverage 17.64156084236005
08-17 00:00 INFO Mean breadth 0.9833093359998627
08-17 00:00 INFO Breadth above minimum coverage 0.9352753291675965
08-17 00:00 INFO Passes breadth-cov filter. Will output tables
There is only 7 minutes of different between the profilings, and the Total FASTA length is the same.
It happens the same with other MAGs and samples... So far I haven't managed to make it worked. Although I have not tried all possible combinations...
Hm, it is unclear to me as to why it's not working... you might want to mess around with the script, the additional scripts were really meant to be dataset specific.
To get around this, you could try creating combined BAMs (either by merging existing BAMs of each sample or mapping both sets of reads in one run) and then running the pipeline on that combined BAM. I also do that for my work sometimes.
Note that the combine_samples.py
will also not be able to calculate linkage in resulting files... so for combined linkage analysis definitely try creating combined BAMs
OK. Actually, the combine_filter_bams.py script worked well when I combined the BAMs from all samples for each species. I guess I can do the same for my replicates or my "blocks" and then do the profiling with inStrainlite to get nucleotide diversity and so on.
For the Fst, if I want two compare between two "blocks" I have to use the profile that I got after using combine_filter_bams.py with all the samples from one "block" vs the profile of the other "block". Is that right?
Thanks in advance!
OK! And yes, that's how it should work. Thanks! Alex
Hi Alex,
I tried to run it in the way I proposed and you told me it should be OK but it is still giving an error. So what I have done is:
python fst.py -g MAG1_predicted_CDS.fna -o sampleAB_sampleCD MAG1:MAG1_filtered_sort_sampleAB MAG1:MAG1_filtered_sort_sampleCD
This is the error message:
loading MAG1:MAG1_filtered_sort_sampleAB
loading MAG1:MAG1_filtered_sort_sampleCD
Traceback (most recent call last):
File "fst.py", line 159, in <module>
main(args)
File "fst.py", line 75, in main
s_final.counts_table[scaf_counter] += scaf
ValueError: operands could not be broadcast together with shapes (212759,4) (17788,4) (212759,4)
This is the profile for MAG1:MAG1_filtered_sort_sampleAB
10-02 12:58 INFO Total read count, divided by 2 10424312
10-02 12:58 INFO Mean read pair sequences length 209.04316153992235
10-02 12:58 INFO **Total FASTA length 3728192**
10-02 12:58 INFO Expected total coverage 584.5007814400522
10-02 12:58 INFO Mapped read pairs 5218419 50%
10-02 12:58 INFO Median end-to-end insert length 132.0
10-02 12:58 INFO Read pairs which pass insert distance filters: 5201088 49%
10-02 12:58 INFO Read pairs which also meet min_mapq of 2 5218419 50%
10-02 12:58 INFO Read pairs which also pass final read pair PID >0.96% 5205555 49%
10-02 12:58 INFO Final expected coverage 291.8802665661936
10-02 13:12 INFO Finalizing and saving
10-02 13:12 INFO Mean coverage 193.86509251669443
10-02 13:12 INFO Mean breadth 0.9903339205706144
10-02 13:12 INFO Breadth above minimum coverage 1.0
10-02 13:12 INFO Passes breadth-cov filter. Will output tables
This is the profile for MAG1:MAG1_filtered_sort_sampleCD
10-02 12:54 INFO Total read count, divided by 2 1476438
10-02 12:54 INFO Mean read pair sequences length 229.6613776225952
10-02 12:54 INFO **Total FASTA length 3728192**
10-02 12:54 INFO Expected total coverage 90.95046206159695
10-02 12:54 INFO Mapped read pairs 739201 50%
10-02 12:54 INFO Median end-to-end insert length 161.0
10-02 12:54 INFO Read pairs which pass insert distance filters: 733594 49%
10-02 12:54 INFO Read pairs which also meet min_mapq of 2 739201 50%
10-02 12:54 INFO Read pairs which also pass final read pair PID >0.96% 737209 49%
10-02 12:54 INFO Final expected coverage 45.41301374386721
10-02 12:57 INFO Finalizing and saving
10-02 12:57 INFO Mean coverage 33.051611880504005
10-02 12:57 INFO Mean breadth 0.9766463207903455
10-02 12:57 INFO Breadth above minimum coverage 1.0
10-02 12:57 INFO Passes breadth-cov filter. Will output tables
Any idea about what is going on? Thanks! Alex
Hey, I wish I could tell what was going on here. That error seems to indicate that there is something wrong with the sizes of the objects ... I'm not sure what it means to have three shapes after a concatenation of two arrays, that is bewildering! I'd suggest playing around with lines 72-75 in that script... for example print the scaffold name to see which scaffold this error is occurring on.
Maybe it is also a numpy version issue? I am on numpy 1.17.1.
Hi sorry for late reply.
Thank you for your suggestion. I have added a couple of lines to the script to print the scaffolds names of both profiles and they seem to be the same:
loading MAG1:MAG1_filtered_sort_sampleA_sampleB MAG1_15 MAG1_26 MAG1_27 MAG1_24 MAG1_18 MAG1_6 MAG1_4 MAG1_9 MAG1_8 MAG1_25 MAG1_10 MAG1_2 MAG1_5 MAG1_17 MAG1_20 MAG1_16 MAG1_7 MAG1_22 MAG1_19 MAG1_11 MAG1_1 MAG1_12 MAG1_23 MAG1_3 MAG1_14 MAG1_13 MAG1_21 loading MAG1:MAG1_filtered_sort_sampleC_sampleD MAG1_4 MAG1_7 MAG1_10 MAG1_5 MAG1_19 MAG1_20 MAG1_26 MAG1_14 MAG1_21 MAG1_18 MAG1_2 MAG1_22 MAG1_13 MAG1_3 MAG1_24 MAG1_1 MAG1_25 MAG1_23 MAG1_17 MAG1_6 MAG1_15 MAG1_11 MAG1_12 MAG1_16 MAG1_9 MAG1_8 MAG1_27
Is there any other thing I can do to check where is the error? How can I detect in which scaffold the error is occurring?
My numpy version is 1.71.2
Thanks in advance.
Finally I managed to make it work. I had to add the following lines to fst.py(I guess is the same for combine_samples.py):
s1.counts_table = sorted(s1.counts_table, key=len) s2.counts_table = sorted(s2.counts_table, key=len) s_final.counts_table = sorted(s_final.counts_table, key=len)
It seems the order was not the same for every count table.
It's great that you figured this out! We will make a change to that in the development version. Thanks! Alex
Hi,
After doing the per-sample profiling I'm trying to aggregate specific outputs by using combine_samples.py.
For 1 species, when trying to combine two outputs from replicate samples, I get the following error:
It has happened the same for another per of samples.