ctlab / GADMA

Genetic Algorithm for Demographic Model Analysis
Other
46 stars 14 forks source link

bootstrap for linked SNPs #80

Closed mariels closed 1 year ago

mariels commented 1 year ago

Hello,

I am a bit confused about how to work with the bootstrap data.

I am generating a 3dsfs from whole genome resequencing data in angsd/winsfs in both the dadi format (using realsfs dadi) and moments format using a conversion script.

Would it be ok to generate the bootstrap data using the command below in angsd: realSFS pop.saf.idx pop.saf.idx pop.saf.idx -bootstrap 100

Then convert all the boostraps to the moments format and include them in the directory for bootstrap data?

Would you recommend this or generating the 3dsfs from a set of unlinked SNPs or selecting a set of unlinked SNP in the dadi SNP file? or would either be fine?

Thank you,

Marie

noscode commented 1 year ago

Dear Marie,

Thank you for your interest, I am glad to help. Bootstrap is a resampling method. If there are n samples, it is possible to create anther (but very similar) set of n samples using random choice (with repeats) from the original set of samples.

For demographic inference the way of bootstrap depends on the information about used SNPs. If SFS was built using unlinked SNP's then bootstrap is simple,one just resample set of SNP's. If you are using dadi or moments it could be done easily for your SFS using resample function (if I remember the name correctly).

If data has linked SNP's, then one needs to performed block-bootstrap, which is usual bootstrap but over the regions of linkage. One need to divide data into regions of linkage and make new samples of those regions. I have found some explanations and one way to do it in the documentation of dadi. Sometimes regions of linkage could be identified more precisely, for example as genes, but in that case you will need to write your own script to build bootstrap data.

Would it be ok to generate the bootstrap data using the command below in angsd: realSFS pop.saf.idx pop.saf.idx pop.saf.idx -bootstrap 100 I have tried to find more information about what kind of bootstrap realSFS is doing here. There is not a lot of information about it, but it looks like it is usual bootstrap over SNP's. If it is so and you have linked loci then it will not work correctly.

If you have a lot of data then I would recommend to use only unlinked SNPs to build SFS as it will be much easier to build bootstrap data. If not then I am afraid you have to apply block-bootstrap and it would be more difficult. Please let me know if it makes sense and what is your solution as then I can provide more assistance.

Best regards, Ekaterina

mariels commented 1 year ago

Dear Ekaterina,

Thank you very much for your answer and help.

I have downsampled my dadi input file to get unlinked SNPs (based on LD decay curve). I have tried to run GADMA but it stops with the following error message: SyntaxError: Construction of data_dict failed: 'Allele2' is not in list

I will copy below this message the full message.

Allele2 is in the header of the input file, for example, the first lines look as follow: REF OUT Allele1 West East NEG Allele2 West East NEG Gene Position cag CGG a 0 0 0 T 16 14 18 NW_021703766.1 18085

Do you know what could be the issue?

Thank you very much, Best regards,

Marie

Data reading UserWarning: Time for generation is not set. All times will be ingenerations (output and pictures). (/projects/mjolnir1/apps/conda/gadma-2.0.0rc25/lib/python3.8/site-packages/gadma/cli/settings_storage.py:1390) UserWarning: Parameters will be in genetic units (Relative parameters). Engines dadi and moments require mutation rate and sequence length for unit translation (/projects/mjolnir1/apps/conda/gadma-2.0.0rc25/lib/python3.8/site-packages/gadma/cli/settings_storage.py:1411) UserWarning: Code for momentsLD will not be generated as: VCF input data is required. (/projects/mjolnir1/apps/conda/gadma-2.0.0rc25/lib/python3.8/site-packages/gadma/cli/settings_storage.py:1536) UserWarning: Code for demes engine will not be generated as ancestral size will be missed in the dem. model. The following options should be set to enable it: Ancestral size as parameter: True (got self.ancestral_size_as_parameter) or Mutation rate (got None) Sequence length (got None) or Theta0 (got None) (/projects/mjolnir1/apps/conda/gadma-2.0.0rc25/lib/python3.8/site-packages/gadma/cli/settings_storage.py:1546) Traceback (most recent call last): File "/projects/mjolnir1/apps/conda/gadma-2.0.0rc25/lib/python3.8/site-packages/gadma/engines/dadi_moments_common.py", line 502, in _read_data_snp_type dd = module.Misc.make_data_dict(data_holder.filename) File "/projects/mjolnir1/apps/conda/gadma-2.0.0rc25/lib/python3.8/site-packages/moments/Misc.py", line 286, in make_data_dict allele2_index = header.split().index("Allele2") ValueError: 'Allele2' is not in list

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/projects/mjolnir1/apps/conda/gadma-2.0.0rc25/bin/gadma", line 10, in sys.exit(main()) File "/projects/mjolnir1/apps/conda/gadma-2.0.0rc25/lib/python3.8/site-packages/gadma/core/core.py", line 69, in main settings_storage.read_data() File "/projects/mjolnir1/apps/conda/gadma-2.0.0rc25/lib/python3.8/site-packages/gadma/cli/settings_storage.py", line 916, in read_data engine.data = self.data_holder File "/projects/mjolnir1/apps/conda/gadma-2.0.0rc25/lib/python3.8/site-packages/gadma/engines/engine.py", line 209, in data self.set_data(new_data) File "/projects/mjolnir1/apps/conda/gadma-2.0.0rc25/lib/python3.8/site-packages/gadma/engines/engine.py", line 191, in set_data self.inner_data = self.read_data(data) File "/projects/mjolnir1/apps/conda/gadma-2.0.0rc25/lib/python3.8/site-packages/gadma/engines/engine.py", line 121, in read_data return cls._read_data(data_holder) File "/projects/mjolnir1/apps/conda/gadma-2.0.0rc25/lib/python3.8/site-packages/gadma/engines/dadi_moments_common.py", line 55, in _read_data data = read_sfs_data(cls.base_module, data_holder) File "/projects/mjolnir1/apps/conda/gadma-2.0.0rc25/lib/python3.8/site-packages/gadma/engines/dadi_moments_common.py", line 630, in read_sfs_data return _read_data_snp_type(module, data_holder) File "/projects/mjolnir1/apps/conda/gadma-2.0.0rc25/lib/python3.8/site-packages/gadma/engines/dadi_moments_common.py", line 504, in _read_data_snp_type raise SyntaxError("Construction of data_dict failed: " + str(e)) SyntaxError: Construction of data_dict failed: 'Allele2' is not in list

noscode commented 1 year ago

Dear Marie,

Okay, perfect. I am not sure, but it looks like dadi struggles to read file. The file although looks fine for me. Check that line with "Allele2" is the first line in file that does not have # at the beginning.

Best regards, Ekaterina

mariels commented 1 year ago

Dear Ekaterina,

Thank you for your answer and sorry for the delay, I was away.

The first line starts as follow, so I think it looks ok: REF OUT Allele1 West East NEG Allele2 West East NEG Gene Position

I have used this perl script to transform the angsd format to dadi: https://github.com/z0on/2bRAD_denovo/blob/master/realsfs2dadi.pl

And this program to thin the file: https://github.com/z0on/2bRAD_denovo/blob/master/thinner_dadi.pl

I will ask the author of the perl scripts and on the dadi program help and I'll write here if I find any solution.

Thanks, Best wishes,

marie

noscode commented 1 year ago

Dear Marie,

It would be great to check if there will be the same error if you use an intermediate file - file in dadi format that is not thinned. If there is no problem with this file, then it will indicate that the second script you used caused the problem.

There is another script that I know of that parse angsd output: https://bitbucket.org/simongravel/moments/src/main/examples/fs_from_angsd/parse_angsd_output.py. Maybe you can try this one.

Best regards, Ekaterina

mariels commented 1 year ago

Dear Ekaterina,

Thank you very much.

I did try to run it without the thinning and got the same error.

I saw this other conversion script but my issue then was how to do the thinning. I'll try to see if it runs with that input file. If that works, then maybe I could do the thinning before estimating the sfs.

Best wishes,

Marie