COMBINE-lab / maximum-likelihood-relatedness-estimation

22 stars 3 forks source link

SNPbam2vcf.py, line 247, KeyError: 'RG' #15

Open mcetincompbio opened 6 years ago

mcetincompbio commented 6 years ago

I am running the following command: python SNPbam2vcf.py ancAnatBam.list BonBarTep.vcf 1000G_D100KApart.list

ancAnatBam.list contains the paths of a list of Bam files. 1000G_D100KApart.list is the snp list in the format requested.

I get this error:

Traceback (most recent call last): File "SNPbam2vcf.py", line 247, in sample_name.append(samfiles[g].header['RG'][0]['SM']) KeyError: 'RG'

Please respond.

kveeramah commented 6 years ago

Do you have readgroups (RGs) added to each of your bam files? You can do this using the PICARD AddOrReplaceReadGroups function

https://broadinstitute.github.io/picard/command-line-overview.html

mcetincompbio commented 6 years ago

Thanks!

Now I am getting a different error.

Traceback (most recent call last): File "SNPbam2vcf.py", line 246, in samfiles[g] = pysam.AlignmentFile(samples[g], "rb") File "pysam/libcalignmentfile.pyx", line 401, in pysam.libcalignmentfile.AlignmentFile.cinit (pysam/libcalignmentfile.c:5835) File "pysam/libcalignmentfile.pyx", line 564, in pysam.libcalignmentfile.AlignmentFile._open (pysam/libcalignmentfile.c:7578) IOError: file `` not found

kveeramah commented 6 years ago

Is there an index file for each corresponding bam file present?

mcetincompbio commented 6 years ago

Yes. The index files are present in the same folder as the bams. bam.list contains the paths of the bams only.

mcetincompbio commented 5 years ago

Any idea what might be happening?

kveeramah commented 5 years ago

It’s hard to parse from the error message alone.

Could you send us one of your bam files to play with (or a portion of them that includes the header and a subset of reads)?

-- Krishna R Veeramah, Ph.D. Assistant Professor Dept of Ecology and Evolution Stony Brook University Rm 616, 650 Life Sciences Building Stony Brook NY 11794-5245 Office: 631-632-1101 Cell: 510-207-1424 E-mail: krishna.veeramah@stonybrook.edu http://life.bio.sunysb.edu/ee/veeramahlab

On Oct 10, 2018, at 2:34 AM, mcetincompbio notifications@github.com wrote:

Any idea what might be happening?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mcetincompbio commented 5 years ago

Here is one of the bams: ftp://ftp.sra.ebi.ac.uk/vol1/ERA172/ERA172924/bam/NA12877_S1.bam

And another, if you need two: ftp://ftp.sra.ebi.ac.uk/vol1/ERA172/ERA172924/bam/NA12890_S1.bam

Do you need the SNP list file as well?

kveeramah commented 5 years ago

Yes, please send me the snp list.

-- Krishna R Veeramah, Ph.D. Assistant Professor Dept of Ecology and Evolution Stony Brook University Rm 616, 650 Life Sciences Building Stony Brook NY 11794-5245 Office: 631-632-1101 Cell: 510-207-1424 E-mail: krishna.veeramah@stonybrook.edu http://life.bio.sunysb.edu/ee/veeramahlab

On Oct 10, 2018, at 9:39 AM, mcetincompbio notifications@github.com wrote:

Here is one of the bams: ftp://ftp.sra.ebi.ac.uk/vol1/ERA172/ERA172924/bam/NA12877_S1.bam

And another, if you need two: ftp://ftp.sra.ebi.ac.uk/vol1/ERA172/ERA172924/bam/NA12890_S1.bam

Do you need the SNP list file as well?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mcetincompbio commented 5 years ago

Two list files are here:

https://www.justbeamit.com/y9x72

Let me know if there is a problem.

Thanks!!

kveeramah commented 5 years ago

Apparently this download link "no longer exists".

On Fri, Oct 12, 2018 at 4:32 AM mcetincompbio notifications@github.com wrote:

Two list files are here:

https://www.justbeamit.com/y9x72

Let me know if there is a problem.

Thanks!!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/COMBINE-lab/maximum-likelihood-relatedness-estimation/issues/15#issuecomment-429248542, or mute the thread https://github.com/notifications/unsubscribe-auth/AM8_k74uz2QIJLkcde7Ghg2RDvY-JJaMks5ukFOfgaJpZM4WvtuR .

-- Krishna R Veeramah, Ph.D. Assistant Professor Dept of Ecology and Evolution Stony Brook University Rm 616, 650 Life Sciences Building Stony Brook NY 11794-5245 Office: 631-632-1101 Cell: 510-207-1424 E-mail: krishna.veeramah@stonybrook.edu http://life.bio.sunysb.edu/ee/veeramahlab

mcetincompbio commented 5 years ago

Ooops!

http://www.rapidshare.com.cn/ib6HAho

kveeramah commented 5 years ago

Can you send me your file with the list of bams? Do you have an extra carriage return, the script seems to be looking for a file '' that is not there.

mcetincompbio commented 5 years ago

That was it!

Thanks a lot. I feel dumb but these things happen...

Now I run the program with the published CEPH family data (whose relatedness is known), and get the following result:

Ind1 Ind2 k0_hat k1_hat k2_hat pi_HAT nbSNP NA12883 NA12890 1.000 0.000 0.000 0.000 1495 NA12877 NA12889 0.999 0.000 0.001 0.001 1496 NA12877 NA12885 0.999 0.001 0.000 0.000 1497 NA12885 NA12889 0.999 0.001 0.000 0.000 1496 NA12883 NA12889 0.999 0.001 0.000 0.000 1493 NA12885 NA12890 1.000 0.000 0.000 0.000 1497 NA12889 NA12890 1.000 0.000 0.000 0.000 1495 NA12877 NA12883 0.999 0.001 0.000 0.000 1494 NA12883 NA12885 0.881 0.002 0.118 0.119 1496 NA12877 NA12890 0.989 0.010 0.000 0.006 1497

NA12877 and NA12889 are first degree relatives. But we get a k0_hat of 0.999 which, if I'm not mistaken, means that no relatedness is found.

What could be the reason for this?

I am downloading the bams from the links I gave earlier.

kveeramah commented 5 years ago

How were your SNPs chosen?

-- Krishna R Veeramah, Ph.D. Assistant Professor Dept of Ecology and Evolution Stony Brook University Rm 616, 650 Life Sciences Building Stony Brook NY 11794-5245 Office: 631-632-1101 Cell: 510-207-1424 E-mail: krishna.veeramah@stonybrook.edu http://life.bio.sunysb.edu/ee/veeramahlab

On Oct 16, 2018, at 9:52 AM, mcetincompbio notifications@github.com wrote:

NA12877 and NA12889

mcetincompbio commented 5 years ago

I took all snps in the 1000 genomes dataset and thinned it so that the snps are 100kb apart.

kveeramah commented 5 years ago

Was there a criteria to the thinning (MAF in CEU?). If the SNPs are not variable or vary rare in your pedigree then there will be no signal.

-- Krishna R Veeramah, Ph.D. Assistant Professor Dept of Ecology and Evolution Stony Brook University Rm 616, 650 Life Sciences Building Stony Brook NY 11794-5245 Office: 631-632-1101 Cell: 510-207-1424 E-mail: krishna.veeramah@stonybrook.edu http://life.bio.sunysb.edu/ee/veeramahlab

On Oct 16, 2018, at 10:09 AM, mcetincompbio notifications@github.com wrote:

I took all snps in the 1000 genomes dataset and thinned it so that the snps are 100kb apart.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

kveeramah commented 5 years ago

Yes. You should preferably choose SNPs where the AF is >10% in CEU individuals.

-- Krishna R Veeramah, Ph.D. Assistant Professor Dept of Ecology and Evolution Stony Brook University Rm 616, 650 Life Sciences Building Stony Brook NY 11794-5245 Office: 631-632-1101 Cell: 510-207-1424 E-mail: krishna.veeramah@stonybrook.edu http://life.bio.sunysb.edu/ee/veeramahlab

On Oct 17, 2018, at 3:57 AM, mcetincompbio notifications@github.com wrote:

It seems like a great majority (possibly more than 95%) of the SNPS are 0/0 in all 5 individuals of the CEPH family.

Maybe that's why.

Should I try with a different set of snps?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mcetincompbio commented 5 years ago

Really appreciate your help!

I did that, and got the following result:

Ind1 Ind2 k0_hat k1_hat k2_hat pi_HAT nbSNP NA12885 NA12889 1.000 0.000 0.000 0.000 17665 NA12883 NA12890 1.000 0.000 0.000 0.000 17655 NA12889 NA12890 1.000 0.000 0.000 0.000 17656 NA12885 NA12890 1.000 0.000 0.000 0.000 17653 NA12877 NA12890 0.998 0.001 0.001 0.001 17661 NA12883 NA12889 0.999 0.001 0.000 0.000 17663 NA12877 NA12885 0.999 0.001 0.000 0.000 17667 NA12877 NA12883 0.999 0.001 0.000 0.000 17664 NA12883 NA12885 0.854 0.002 0.144 0.145 17660 NA12877 NA12889 0.998 0.002 0.000 0.001 17668

kveeramah commented 5 years ago

Do you only have members of the CEU pedigree in this dataset, or are there unrelated samples as well. Perhaps you could send me your final vcf so I could look at it? We have analyzed this pedigree before and accurately determined their relatedness (in our preprint).

On Fri, Oct 19, 2018 at 8:27 AM mcetincompbio notifications@github.com wrote:

Really appreciate your help!

I did that, and got the following result:

Ind1 Ind2 k0_hat k1_hat k2_hat pi_HAT nbSNP NA12885 NA12889 1.000 0.000 0.000 0.000 17665 NA12883 NA12890 1.000 0.000 0.000 0.000 17655 NA12889 NA12890 1.000 0.000 0.000 0.000 17656 NA12885 NA12890 1.000 0.000 0.000 0.000 17653 NA12877 NA12890 0.998 0.001 0.001 0.001 17661 NA12883 NA12889 0.999 0.001 0.000 0.000 17663 NA12877 NA12885 0.999 0.001 0.000 0.000 17667 NA12877 NA12883 0.999 0.001 0.000 0.000 17664 NA12883 NA12885 0.854 0.002 0.144 0.145 17660 NA12877 NA12889 0.998 0.002 0.000 0.001 17668

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/COMBINE-lab/maximum-likelihood-relatedness-estimation/issues/15#issuecomment-431346067, or mute the thread https://github.com/notifications/unsubscribe-auth/AM8_k6nYpOl6z29Qux0q9uULU5gF3es0ks5umcUegaJpZM4WvtuR .

-- Krishna R Veeramah, Ph.D. Assistant Professor Dept of Ecology and Evolution Stony Brook University Rm 616, 650 Life Sciences Building Stony Brook NY 11794-5245 Office: 631-632-1101 Cell: 510-207-1424 E-mail: krishna.veeramah@stonybrook.edu http://life.bio.sunysb.edu/ee/veeramahlab

mcetincompbio commented 5 years ago

I did read your preprint, and that makes me think I am doing something wrong.

Here is the final vcf. It only has the 5 members of the CEPH pedigree. http://www.rapidshare.com.cn/J8cKZph

Many thanks!

mcetincompbio commented 5 years ago

Hi,

Were you able to look at the file?