ZhiGroup / RaPID

MIT License
30 stars 6 forks source link

unphased data #2

Open savitakartik opened 7 years ago

savitakartik commented 7 years ago

Curious to know if/when you might support raw genotypes in RaPID?

Thanks!

zhizhid commented 7 years ago

It is difficult to give a timeline as this is a new research question. For now we have to stick with phased data.

On Mar 2, 2017, at 3:37 AM, savitakartik notifications@github.com wrote:

Curious to know if/when you might support raw genotypes in RaPID?

Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ZhiGroup/RaPID/issues/2, or mute the thread https://github.com/notifications/unsubscribe-auth/AJZtEpdBJGOWSG8xBSzTkr1X3APLD3Ouks5rho3pgaJpZM4MQwNr.

laiyy1 commented 6 years ago

I want to know if the input file of vcf which has the numeric genotype of "-1", how to deal with the question?

Thanks.

ardalan-n commented 6 years ago

The current implementation only accepts bi-allelic sites (0's and 1's). If there is any other value (e.g. -1), the corresponding site is skipped.

laiyy1 commented 6 years ago

First, thanks your reply very much. Second, I met a question, but I had to ask you. What's the meaning of the title of genetic_maps, like Rate(cM/Mb) and Map(cM), how to calculate the Rate(cM/Mb) and Map(cM)? Look forward to your reply.

ardalan-n commented 6 years ago

Rate(cM/Mb) denotes the rate of recombination at each genomic location and Map(cM) denotes the corresponding genetic location. Please note that RaPID (v.1.6) uses the genetic location and the Rate is ignored. The provided genetic map files were generated by Phase II HapMap ( ftp://ftp.hapmap.org/hapmap/recombination/2011-01_phaseII_B37/genetic_map_HapMapII_GRCh37.tar.gz)

laiyy1 commented 6 years ago

Is RaPID software can apply to other species, such as maize, rice and so on ? Look forward to your reply. Thanks.

zhizhid commented 6 years ago

I would say so. But it will need some adjusting. So far we are only focused on humans.

On Jul 31, 2018, at 3:10 AM, laiyy1 notifications@github.com wrote:

Is RaPID software can apply to other species, such as maize, rice and so on ? Look forward to your reply. Thanks.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ZhiGroup/RaPID/issues/2#issuecomment-409134579, or mute the thread https://github.com/notifications/unsubscribe-auth/AJZtEvFJx9SC6gvZYhuk_UGkigZBiRr4ks5uMBD5gaJpZM4MQwNr.

Tuisto59 commented 5 years ago

Hi I would like if this tool can be applied to find IBD in raw data genetic test seeling by DNA company such as 23andme , AncestryDNA, FamilyTreeDNA, LivingDNA, MyHeritage, the file are like table with snp id, chromosome, position, genotype, if correctly converted into the accepted input format (like VCF or BED etc...) Thanks for our repply ! Do you think using blast is a good solution ?

zhizhid commented 5 years ago

Hi Yoan,

I am not sure if your question has been answered. At this moment RaPID only takes VCF format input. But I guess you can find third-party scripts to convert between these formats.

Thanks,

Degui

On May 8, 2019, at 3:46 PM, BOUZIN Yoan notifications@github.com wrote:

Hi I would like if this tool can be applied to find IBD in raw data genetic test seeling by DNA company such as 23andme , AncestryDNA, FamilyTreeDNA, LivingDNA, MyHeritage, the file are like table with snp id, chromosome, position, genotype, if correctly converted into the accepted input format (like VCF or BED etc...) Thanks for our repply !

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ZhiGroup/RaPID/issues/2#issuecomment-490644783, or mute the thread https://github.com/notifications/unsubscribe-auth/ACLG2EUSSIEZT7WEPGDXSM3PUM33BANCNFSM4DCDANVQ.

zhizhid commented 5 years ago

Hi Yoan,

I think you might have gotten confused loose sequencing alignment (like blast) and anchored sequencing matching (like RaPID). They are different computational problems.

Here in this repo we are only fielding questions related to RaPID, due to our limited capacity.

Best,

Degui

On Jun 3, 2019, at 5:20 PM, BOUZIN Yoan notifications@github.com wrote:

Hi thanks for the repply, I was thinking maybe my email was lost somewhere or maybe something like that.

I actually work with that solution :

23andMe, MyHeritage, LivingDNA, AncestryDNA, FamillyTreeDNA, make the possibility to download their raw data.

All of them have common format such as comma separated values , or tabulated separated values, with the rsid, chromosome and position, sometime one or two columns for the genotype (allele 1 and allele2), and few description line in the begenning. I devellop a parser that recognize these format.

Also genetic test can have different number and content of SNP, making them difficult to be compared, for the moment I dont use imputation to guess the genotype for each not common SNP between two test (like MyHeritage do). The difficulty are that all the test, for all the company, for all version and year are sometime very different between few to many hundred thousand of SNPs between test, making poor « intersection » of common SNPs.

My skill in bioinformatics and my instinct follow the BLAST and FASTA methodologie because BLAST are very fast to analyze huge amount of sequences. In fact we want in my company to be able to analyse millions of test and show the results in term of minutes... we hope...

I normalize the genotype of all the SNP into alphabetic order, like genotype TA for the two allele of one SNP is converted to be AT in alphabetic order. Because actually we can't guess wich are from wich allele.

After I make a file in FASTA format, with two sequence (one for each allele). The sequence is constructed by the concatenation of all the genotype of the first normalized allele and the second allele by concatenation of the 2nd letter of the genotype for the second allele.

Before normalisation

SNP

Genotype1

Gennotype2

rs1

A

C

rs2

T

C

rs3

A

A

rs4

G

C

rs5

T

G

After normalisation

SNP

Genotype1

Gennotype2

rs1

A

C

rs2

C

T

rs3

A

A

rs4

C

G

rs5

G

T

After tranform to FASTA

allele1

ACACG

allele2

CTAGT

Like that, we can make a BLASTn database with makeblastdb for this file.

Each test will be compared to all the other, and for each comparison, a FASTA with the common SNP between each kit will be created.

A single BLAST database will be constructed for each file, and a virtual consensus database (blastdb_aliastool) will be created, enable the possibility to dont recreate all the BLASTdb if all the sequences where inside and make also the possibility to choose, and parallelize.

So a FASTA will be compared to all the other, the output will be the common BLAST output, with the help of biopython I use the xml format to parse it.

An allele will be compared twice with the allele1 and the allele2 of all the other kit.

A BLAST output will be with ungapped option

so given another test (n°2)

allele1

ATGAA

allele2

CTGCA

comparison will be :

Test1/test2

Allele1 test1

Allele2 test1

Allele1 test2

ACACG

| ==> 10000

ATGAA

CTAGT

| => 01000

ATGAA

Allele2 test2

ACACG

| => 00010

CTGCA

CTAGT

|| => 11000

CTGCA

By replacing pipe by 1 and no identity by 0, and summing the Test1(allele1 -allele2) x Test2 (allele1 – allele) I get :

10000

00010

01000

11000


22010

So we have an IDB with the 2 consecutive first SNP

With the help of the lineage python package and the HapMap Phase II (build37) we can compute the centimorgan

Actually I make a artificial pedigree tree with 50000 random genetic test with all the random probability of the litterature.

So with BLAST and with this method I can detect consecutive segment from two genetic test.

Because I am not a specialist in genetic of population my best is phylogenetics, genomics and proteomics. Can you comment my methodology ? Maybe point out the weekness, and the strenght ?

I was wondering if RaPID will be fastest like BLAST ? They is also HS-BLAST using BWT transform, I dont know if its actually well maintened...

What is exactly also plink format ? Can you tell me if with fasta of only with information such as « rsid, chromosome, position, genotype » I have enough information to make also VCF file to use RaPID with my actual genetic test ?

Thanks you for your help !

Yoan BOUZIN

Le lun. 3 juin 2019 à 23:19, zhizhid notifications@github.com a écrit :

Hi Yoan,

I am not sure if your question has been answered. At this moment RaPID only takes VCF format input. But I guess you can find third-party scripts to convert between these formats.

Thanks,

Degui

On May 8, 2019, at 3:46 PM, BOUZIN Yoan notifications@github.com wrote:

Hi I would like if this tool can be applied to find IBD in raw data genetic test seeling by DNA company such as 23andme , AncestryDNA, FamilyTreeDNA, LivingDNA, MyHeritage, the file are like table with snp id, chromosome, position, genotype, if correctly converted into the accepted input format (like VCF or BED etc...) Thanks for our repply !

— You are receiving this because you commented. Reply to this email directly, view it on GitHub < https://github.com/ZhiGroup/RaPID/issues/2#issuecomment-490644783>, or mute the thread < https://github.com/notifications/unsubscribe-auth/ACLG2EUSSIEZT7WEPGDXSM3PUM33BANCNFSM4DCDANVQ .

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ZhiGroup/RaPID/issues/2?email_source=notifications&email_token=ACPNJJYN3VUZATZH5Y45C7LPYWDF5A5CNFSM4DCDANV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW2XFPI#issuecomment-498430653, or mute the thread https://github.com/notifications/unsubscribe-auth/ACPNJJ72XZOIWX62RYTYKH3PYWDF5ANCNFSM4DCDANVQ .

-- BOUZIN Yoan - Ingénieur d'Etude Bioinformaticien Institut de Génétique Humaine - Groupe GNOSTIC (Genotoxic stress bodies in cancer physiopathology) 15 Avenue Charles Flahault, 34090 Montpellier (IMGT Lab) [image: Institut de Génétique Humaine (IGH)] — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ZhiGroup/RaPID/issues/2?email_source=notifications&email_token=ACLG2EQ2HPZ322F2WHFFKC3PYWKKVA5CNFSM4DCDANV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODW23LFI#issuecomment-498447765, or mute the thread https://github.com/notifications/unsubscribe-auth/ACLG2ES3CCXSUK52TPZYJ2LPYWKKVANCNFSM4DCDANVQ.

YingZhou001 commented 5 years ago

Hi Degui,

I am wondering which version can support vcf format? The version v.1.6 can not work because of the license problem.

Thanks,

Ying

zhizhid commented 5 years ago

Hi Ying,

We are working on the next version. Hopefully will be available in a month or two.

Degui

On Aug 16, 2019, at 11:07 AM, YingZhou001 notifications@github.com wrote:

Hi Degui,

I am wondering which version can support vcf format? The version v.1.6 can not work because of the license problem.

Thanks,

Ying

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ZhiGroup/RaPID/issues/2?email_source=notifications&email_token=ACLG2ERAXG3PEGMK23QKFYTQE3GDNA5CNFSM4DCDANV2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4PA2TA#issuecomment-522063180, or mute the thread https://github.com/notifications/unsubscribe-auth/ACLG2ERSIBSQUHUKJDJVMFLQE3GDNANCNFSM4DCDANVQ.

YingZhou001 commented 5 years ago

Hi Degui,

I am trying to follow your analysis in your GB paper to use RaPID v.1.2.3 to detect IBD segments from UK Biobank data. I wrote a pipline with your converting program "convert_vcf_to_macs.py" so that I can convert vcf file into MaCS format. By applying my pipline to your data "4k_1e8_e0.0025.vcf.gz" (I think it is better to rename as "4k_1e7_e0.0025.vcf.gz" since the length is 10Mb), I confirmed it can be used to generate input for RaPID. Then I turned to the UK Biobank data. I sampled first 100 columns of the vcf files for each chromosomes and run my pipline and RaPID on them. It is interesting that RaPID reported errors only on chr9, chr21, and chr22. Here is my script to run on chr22:

RaPID_v.1.2.3 -i chr22.tmp.100.macs.input -o test.22.5cM -r 10 -w 3 -s 2 -l 5000000 -t 51224208 -g chr22.tmp.100.macs.g

       Create sub-samples..
       double free or corruption (out)
       Abort

Could you tell me what should I do next to fix this problem? If possible, could you share your pipeline used for analysis in your GB paper so that it will be easier for me to learn how to run RaPID from it?

By the way, I also notice the notations conflict among your paper, examples, and the RaPID program. For example, "-l" in your program indicates "Minimum IBD length in cM", and it should be in bp in your example, but it has a different presentation as "-L" in your paper. I also feel very confused about setting the length threshold, you have "-l" and "-d" at the same time. Is it really necessary to customize two thresholds?

Best,

Ying

ardalan-n commented 5 years ago

Hi Ying,

Please make sure that the created MaCS files do not contain any missing values. The genetic mapping file should also contain the genetic locations for all the sites in the MaCS format (in the same order). I think some sites may not be SNPs (e.g. INDEL) but bi-allelic INDELs, and the script to convert MaCS into VCF might ignore those sites. Those sites can be ignored but the corresponding mapping file should match the MaCS file.

The parameter -d actually was added to tackle the uneven genetic mapping. The assumption of 1 Mbps ~ 1 cM may not hold for some regions across the chromosomes. To tackle this issue, we provided the option to set two cut-offs. -d determines the length in cM and -l in bps. For example, if you search for 5 cM, then you can set -d to 5 (cM) and -l is the minimum length in bps. I recommend using the minimum distance in bps for -l that covers most of the 5 cM distances in real data where there's an uneven genetic mapping (e.g. 90% of the all 5 cM distances across the chromosome).

Best, Ardalan

YingZhou001 commented 5 years ago

Hi Ardalan,

Thanks for your reply. According to your suggests, I use vcftools to remove INDELs from the data (--remove-indels --min-alleles 2 --max-alleles 2), and 11638 SNPs remained. The table s3 in your paper says you have 10911 SNPs for chr22. So do you extra steps for the quality control? Then I try my pipeline on this data and it failed again, but this time it said different:

RaPID_v.1.2.3 -i chr22.tmp.100.macs.input -o test.22.5cM -r 10 -w 3 -s 2 -d 5 -l 1239138 -t 51224208 -g chr22.tmp.100.macs.g Create sub-samples.. munmap_chunk(): invalid pointer

There is no missing in the data sets.

Best,

Ying

ardalan-n commented 5 years ago

Hi Ying,

I assume some sites are INDEL but have been labeled as SNP because they were just bi-allelic, e.g. when the reference allele is AG and the alternative allele is G. As a result, VCFtools may not filter them out. However, those sites should not cause the error you mentioned. I think that there might be some discrepancies between the provided MaCS file and the genetic mapping file. I have attached a script to interpolate the values for the sites in a VCF file given a genetic mapping file.

interpolate_loci.zip

Best, Ardalan

YingZhou001 commented 5 years ago

Hi Ardalan,

Thanks for your reply, can you say something about the input format of interpolate_loci.py? I don't use python to program.

Best,

Joe

ardalan-n commented 5 years ago

The input files are the mapping file and the VCF file. The mapping file should contain the genomic location in the second column and the genetic location in the last column (space or tab-delimited).

Usage: python interpolate_loci.py

Best, Ardalan

YingZhou001 commented 5 years ago

Thanks Ardalan, Now I can use your script to generate map input for RaPID, with reference Hapmap LD map.

And here are the first 5 and last 5 lines:

0 0.04354310426330903 1 1.0386992039668905 2 1.0387688597532405 3 1.4087540288873843 4 1.4140660491921375 .... 11633 74.07951522370122 11634 74.0811845894881 11635 74.08619397444285 11636 74.10185406282723 11637 74.1058950938716

But the program failed again, saying different as:

RaPID_v.1.2.3 -i chr22.tmp.100.macs.input -o test.22.5cM -r 10 -w 3 -s 2 -d 5 -l 1239138 -t 51224208 -g out.22.g Create sub-samples.. free(): invalid size

here is my script to prepare the MaCS input from vcf file:

${vcftools} --gzvcf ${vcf} --remove-indels --min-alleles 2 --max-alleles 2 --recode --temp /scratch --stdout | gzip -c > tmp.chr${chr}.${da}.vcf.gz

length="$(zcat tmp.chr${chr}.${da}.vcf.gz | cut -f 2 | tail -n1)" hap="$(zcat tmp.chr${chr}.${da}.vcf.gz | head -n 100 | grep "#CHROM" | cut -f 10- | sed "s/\t/\n/g" | wc -l)" hap=$( expr 2 '*' "${hap}" )

python ${vcf2macs} tmp.chr${chr}.${da}.vcf.gz ${length} chr${chr}.${da}.macs.input ${hap}

Notes: ${vcf2macs} is your format converting python script.

I still can't find out what is wrong there.

Best,

Ying

liangqinyi commented 5 months ago

hello I originally analyzed the maize genome data using RaPID1.7 version, but after running the vcf compressed file and map file, I obtained dozens of files named macs.subsample_, including a max file. May I ask if this is normal? Can't Rapid be used to analyze data other than humans? Or can users adjust parameters to analyze data outside of humans?