alexcritschristoph / soil_popgen

Reproducible scripts and notebooks for 2019 paper on population genetics in metagenomes
GNU General Public License v3.0
14 stars 0 forks source link

error combine_samples #2

Closed palomo11 closed 4 years ago

palomo11 commented 5 years ago

Hi,

After doing the per-sample profiling I'm trying to aggregate specific outputs by using combine_samples.py.

For 1 species, when trying to combine two outputs from replicate samples, I get the following error:

$ python /home/projects/env_10000/people/name/scripts/soil_popgen/inStrain_lite/inStrain_lite/combine_samples.py sample1-raw_sort_filter_info_MAG1 sample2-raw_sort_filter_info_MAG1 -o sample1_sample2
loading sample1-raw_sort_filter_info_MAG1
loading sample2-raw_sort_filter_info_MAG1
Traceback (most recent call last):
  File "/home/projects/env_10000/people/name/scripts/soil_popgen/inStrain_lite/inStrain_lite/combine_samples.py", line 102, in <module>
    main(args)
  File "/home/projects/env_10000/people/name/scripts/soil_popgen/inStrain_lite/inStrain_lite/combine_samples.py", line 41, in main
    s_final.counts_table[scaf_counter] += scaf
ValueError: operands could not be broadcast together with shapes (17788,4) (5276,4) (17788,4) 

It has happened the same for another per of samples.

alexcritschristoph commented 5 years ago

This seems like the FASTAs used for each initial run may have been different - can you check to see that the exact same FASTA was used for each run being combined?

Has this script successfully worked for you on other genomes thus far? Thanks, Alex

palomo11 commented 5 years ago

Thanks for the answer.

I'm quite sure the FASTA used was the same... This is the output from the profiling for each sample:

$ cat sample1-raw_sort_filter_info_MAG1.log
08-16 23:51 DEBUG    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
08-16 23:51 DEBUG    Command to run was: /home/projects/env_10000/people/name/scripts/soil_popgen/inStrain_lite/bin/inStrain_lite -p 28 -s 30 -c 0.96 -o sample1-raw_sort_filter_info_MAG1 --min_breadth_cov 0.5,5 sample1-raw_sort_filtered_sort.bam MAG1.fa

08-16 23:52 INFO     Total read count, divided by 2 453112
08-16 23:52 INFO     Mean read pair sequences length    157.88698430631808
08-16 23:52 INFO     Total FASTA length 3728192
08-16 23:52 INFO     Expected total coverage    19.189056581046366
08-16 23:52 INFO     Mapped read pairs  226588  50%
08-16 23:52 INFO     Median end-to-end insert length    84.0
08-16 23:52 INFO     Read pairs which pass insert distance filters: 222574  49%
08-16 23:52 INFO     Read pairs which also meet min_mapq of 2   226588  50%
08-16 23:52 INFO     Read pairs which also pass final read pair PID >0.96%  226524  49%
08-16 23:52 INFO     Final expected coverage    9.593173107233854
08-16 23:52 INFO     Finalizing and saving
08-16 23:52 INFO     Mean coverage  5.708904745249172
08-16 23:52 INFO     Mean breadth   0.9003111964190685
08-16 23:52 INFO     Breadth above minimum coverage 0.5071997364942578
08-16 23:52 INFO     Passes breadth-cov filter. Will output tables
$ cat sample2-raw_sort_filter_info_MAG1.log
08-16 23:58 DEBUG    !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
08-16 23:58 DEBUG    Command to run was: /home/projects/env_10000/people/name/scripts/soil_popgen/inStrain_lite/bin/inStrain_lite -p 28 -s 30 -c 0.96 -o sample2-raw_sort_filter_info_MAG1 --min_breadth_cov 0.5,5 sample2-raw_sort_filtered_sort.bam MAG1.fa

08-16 23:59 INFO     Total read count, divided by 2 822294
08-16 23:59 INFO     Mean read pair sequences length    228.76555678713453
08-16 23:59 INFO     Total FASTA length 3728192
08-16 23:59 INFO     Expected total coverage    50.45677496028102
08-16 23:59 INFO     Mapped read pairs  411396  50%
08-16 23:59 INFO     Median end-to-end insert length    147.0
08-16 23:59 INFO     Read pairs which pass insert distance filters: 407184  49%
08-16 23:59 INFO     Read pairs which also meet min_mapq of 2   411396  50%
08-16 23:59 INFO     Read pairs which also pass final read pair PID >0.96%  410883  49%
08-16 23:59 INFO     Final expected coverage    25.212188178443654
08-17 00:00 INFO     Finalizing and saving
08-17 00:00 INFO     Mean coverage  17.64156084236005
08-17 00:00 INFO     Mean breadth   0.9833093359998627
08-17 00:00 INFO     Breadth above minimum coverage 0.9352753291675965
08-17 00:00 INFO     Passes breadth-cov filter. Will output tables

There is only 7 minutes of different between the profilings, and the Total FASTA length is the same.

It happens the same with other MAGs and samples... So far I haven't managed to make it worked. Although I have not tried all possible combinations...

alexcritschristoph commented 5 years ago

Hm, it is unclear to me as to why it's not working... you might want to mess around with the script, the additional scripts were really meant to be dataset specific. To get around this, you could try creating combined BAMs (either by merging existing BAMs of each sample or mapping both sets of reads in one run) and then running the pipeline on that combined BAM. I also do that for my work sometimes. Note that the combine_samples.py will also not be able to calculate linkage in resulting files... so for combined linkage analysis definitely try creating combined BAMs

palomo11 commented 5 years ago

OK. Actually, the combine_filter_bams.py script worked well when I combined the BAMs from all samples for each species. I guess I can do the same for my replicates or my "blocks" and then do the profiling with inStrainlite to get nucleotide diversity and so on.

For the Fst, if I want two compare between two "blocks" I have to use the profile that I got after using combine_filter_bams.py with all the samples from one "block" vs the profile of the other "block". Is that right?

Thanks in advance!

alexcritschristoph commented 5 years ago

OK! And yes, that's how it should work. Thanks! Alex

palomo11 commented 4 years ago

Hi Alex,

I tried to run it in the way I proposed and you told me it should be OK but it is still giving an error. So what I have done is:

This is the error message:

loading MAG1:MAG1_filtered_sort_sampleAB
loading MAG1:MAG1_filtered_sort_sampleCD
Traceback (most recent call last):
  File "fst.py", line 159, in <module>
    main(args)
  File "fst.py", line 75, in main
    s_final.counts_table[scaf_counter] += scaf
ValueError: operands could not be broadcast together with shapes (212759,4) (17788,4) (212759,4) 

This is the profile for MAG1:MAG1_filtered_sort_sampleAB

10-02 12:58 INFO     Total read count, divided by 2 10424312
10-02 12:58 INFO     Mean read pair sequences length    209.04316153992235
10-02 12:58 INFO     **Total FASTA length   3728192**
10-02 12:58 INFO     Expected total coverage    584.5007814400522
10-02 12:58 INFO     Mapped read pairs  5218419 50%
10-02 12:58 INFO     Median end-to-end insert length    132.0
10-02 12:58 INFO     Read pairs which pass insert distance filters: 5201088 49%
10-02 12:58 INFO     Read pairs which also meet min_mapq of 2   5218419 50%
10-02 12:58 INFO     Read pairs which also pass final read pair PID >0.96%  5205555 49%
10-02 12:58 INFO     Final expected coverage    291.8802665661936
10-02 13:12 INFO     Finalizing and saving
10-02 13:12 INFO     Mean coverage  193.86509251669443
10-02 13:12 INFO     Mean breadth   0.9903339205706144
10-02 13:12 INFO     Breadth above minimum coverage 1.0
10-02 13:12 INFO     Passes breadth-cov filter. Will output tables

This is the profile for MAG1:MAG1_filtered_sort_sampleCD

10-02 12:54 INFO     Total read count, divided by 2 1476438
10-02 12:54 INFO     Mean read pair sequences length    229.6613776225952
10-02 12:54 INFO     **Total FASTA length   3728192**
10-02 12:54 INFO     Expected total coverage    90.95046206159695
10-02 12:54 INFO     Mapped read pairs  739201  50%
10-02 12:54 INFO     Median end-to-end insert length    161.0
10-02 12:54 INFO     Read pairs which pass insert distance filters: 733594  49%
10-02 12:54 INFO     Read pairs which also meet min_mapq of 2   739201  50%
10-02 12:54 INFO     Read pairs which also pass final read pair PID >0.96%  737209  49%
10-02 12:54 INFO     Final expected coverage    45.41301374386721
10-02 12:57 INFO     Finalizing and saving
10-02 12:57 INFO     Mean coverage  33.051611880504005
10-02 12:57 INFO     Mean breadth   0.9766463207903455
10-02 12:57 INFO     Breadth above minimum coverage 1.0
10-02 12:57 INFO     Passes breadth-cov filter. Will output tables

Any idea about what is going on? Thanks! Alex

alexcritschristoph commented 4 years ago

Hey, I wish I could tell what was going on here. That error seems to indicate that there is something wrong with the sizes of the objects ... I'm not sure what it means to have three shapes after a concatenation of two arrays, that is bewildering! I'd suggest playing around with lines 72-75 in that script... for example print the scaffold name to see which scaffold this error is occurring on.

Maybe it is also a numpy version issue? I am on numpy 1.17.1.

palomo11 commented 4 years ago

Hi sorry for late reply.

Thank you for your suggestion. I have added a couple of lines to the script to print the scaffolds names of both profiles and they seem to be the same:

loading MAG1:MAG1_filtered_sort_sampleA_sampleB MAG1_15 MAG1_26 MAG1_27 MAG1_24 MAG1_18 MAG1_6 MAG1_4 MAG1_9 MAG1_8 MAG1_25 MAG1_10 MAG1_2 MAG1_5 MAG1_17 MAG1_20 MAG1_16 MAG1_7 MAG1_22 MAG1_19 MAG1_11 MAG1_1 MAG1_12 MAG1_23 MAG1_3 MAG1_14 MAG1_13 MAG1_21 loading MAG1:MAG1_filtered_sort_sampleC_sampleD MAG1_4 MAG1_7 MAG1_10 MAG1_5 MAG1_19 MAG1_20 MAG1_26 MAG1_14 MAG1_21 MAG1_18 MAG1_2 MAG1_22 MAG1_13 MAG1_3 MAG1_24 MAG1_1 MAG1_25 MAG1_23 MAG1_17 MAG1_6 MAG1_15 MAG1_11 MAG1_12 MAG1_16 MAG1_9 MAG1_8 MAG1_27

Is there any other thing I can do to check where is the error? How can I detect in which scaffold the error is occurring?

My numpy version is 1.71.2

Thanks in advance.

palomo11 commented 4 years ago

Finally I managed to make it work. I had to add the following lines to fst.py(I guess is the same for combine_samples.py):

s1.counts_table = sorted(s1.counts_table, key=len) s2.counts_table = sorted(s2.counts_table, key=len) s_final.counts_table = sorted(s_final.counts_table, key=len)

It seems the order was not the same for every count table.

alexcritschristoph commented 4 years ago

It's great that you figured this out! We will make a change to that in the development version. Thanks! Alex