e-jorsboe / fastNGSadmix

Program for estimating admixture proportions and doing principal component analysis of a single NGS sample
GNU General Public License v3.0
9 stars 4 forks source link

Exome? #5

Closed sabrinacamp2 closed 4 years ago

sabrinacamp2 commented 4 years ago

Hi!

I was trying to follow the Quick start - 1000 genomes reference panel tutorial using a BAM derived from whole exome sequencing. The output of the ANGSD step to get the beagle formatted file gives me 0 variants. This is what it says in the log:

-> Total number of sites analyzed: 709988831 -> Number of sites retained after filtering: 0

Does fastNGSadmix not work with WES? If so, do you happen to recommend any other tool to derive admixture proportions using a WES BAM file?

All the best, Sabrina

e-jorsboe commented 4 years ago

Hi Sabrina,

It seems like there is an issue with your ANGSD command. Could you post it here?

The only filters in the ANGSD command I give for the quick start are these: -sites $SITES -minMapQ 30 -minQ 20

So perhaps check how the -sites file relates to your data, are they both aligned to the same reference genome? (1000 genomes is hg19).

And what is WES?

Best, Emil

sabrinacamp2 commented 4 years ago

Hi Emil,

I used this ANGSD command

angsd -i sample.bam -GL 2 \ -sites sites -doGlf 2 \ -doMajorMinor 3 -minMapQ 30 -minQ 20 \ -doDepth 1 -doCounts 1 \ -out ${sample_id}

My BAM is also aligned to hg19. By WES, I mean whole-exome sequencing. I was curious if the tool only worked/ provided accurate results for a whole-genome sequenced bam or if whole-exome worked as well.

sabrinacamp2 commented 4 years ago

Hi Emil,

I have one other question. Unlike NGSadmix, fastNGSadmix does not take a K value as an input. Does the K value equal the number of populations you define in the -whichPops argument?

e-jorsboe commented 4 years ago

Hi,

Ok, I still think there is some mismatch between the BAM file and the .sites file.

Can you try and run this command in ANGSD: angsd -i sample.bam -GL 2 -doGlf 2 -doMajorMinor 1 -minMapQ 30 -minQ 20 -doDepth 1 -doCounts 1 -out out -doMaf 3 -rf chr1:1-1000000 (-rf only analyses variants in this window, so feel free to change it to whenever your first variants are)

And give me the head of the .mafs.gz file?

And also give me the head of the .sites file?

I think it might be that in one file the chrs are coded "chr1" and in the other just "1".

fastNGSadmix should also work for exome data, especially whole-exome data!

################

And fastNGSadmix does not take a K parameter since it is more supervised than that, we tell it to look for certain populations. But yeah I guess you can see the number of populations included, as something similar to K.

Best, Emil

sabrinacamp2 commented 4 years ago

Thank you so much for your help so far. I tried running the command you suggest and I got this error:

[getFilenames] -> Problems opening file

I checked the BAM header and my contigs are 1..22,X. I also checked the sites file and it is the same formatting. I am a little confused on which file would be .mafs.gz. The ending of my sequencing file is .bam and usually .maf is a variant call format.

Thanks for the insight on the K parameter!

e-jorsboe commented 4 years ago

Hi,

It seems like you have specified the wrong filename (I just guessed of the name of the .bam file, so please use the name of your .bam file). Or that there is something wrong with your .bam file.

The mafs.gz file will be generated by ANGSD when the aforementioned command runs successfully.

e-jorsboe commented 4 years ago

Hi Sabrina, did you resolve your problems? Or is this still an active issue?

sabrinacamp2 commented 4 years ago

Hi Emil,

I decided to just bypass the ANGSD step by getting genotype likelihoods via bcftools -> using plink to turn the vcf into the binary plink files -> using these outputs directly in fastNGSadmix.

I will close the issue since I am no longer trying to troubleshoot the issue.

Thank you for all of your help! Best, Sabrina