Format of the file with call rate information

rashindrie commented 3 years ago

Hi,

Im trying to use mocha on my genotype array data. After creating the minimal BCF, I tried the following command

awk -F"\t" '$2<.97 {print $1}' $crt > samples_xcl_list.txt; \
echo '##INFO=<ID=JK,Number=1,Type=Float,Description="Jukes Cantor">' | \
  bcftools annotate --no-version -Ou -a $dup -c CHROM,FROM,TO,JK -h /dev/stdin $dir/$pfx.unphased.bcf | \
  bcftools view --no-version -Ou -S ^samples_xcl_list.txt | \
  bcftools +fill-tags --no-version -Ou -t ^Y,MT,chrY,chrM -- -t ExcHet,F_MISSING | \
  bcftools view --no-version -Ou -G | \
  bcftools annotate --no-version -Ob -o $dir/$pfx.xcl.bcf \
    -i 'FILTER!="." && FILTER!="PASS" || INFO/JK<.02 || INFO/ExcHet<1e-6 || INFO/F_MISSING>1-.97' \
    -x ^INFO/JK,^INFO/ExcHet,^INFO/F_MISSING && \
  bcftools index -f $dir/$pfx.xcl.bcf;
/bin/rm samples_xcl_list.txt

This gave me the below error:

Could not parse JK at chr1:632287 .. [-]
Error: exclude called for sample that does not exist in header: "Sample ID,Call Rate". Use "--force-samples" to ignore this error.
Failed to read from standard input: unknown file type
Failed to read from standard input: unknown file type
Failed to read from standard input: unknown file type

The format of the file I currently use for $crt

Sample ID,Call Rate
06-056,0.9990433
06-061,0.9992602
06-044,0.9994463
06-011,0.9994555
06-012,0.9993386
.....

I wasn't sure how the $crt file should look like, so I created a CSV in the following format. I was wondering if there is a set specification for this file?

Thanks, Rashindrie

freeseek commented 3 years ago

@Rashindrie you are right, I forgot to explain how to format the file. The $crt file needs to be in tab delimeted format and without the header. So you can use the following command:

tail -n+2 $crt | tr ',' '\t' | sponge $crt

And it should make the file compatible with the awk command to generate the samples_xcl_list.txt file.

rashindrie commented 3 years ago

Thanks @freeseek!

Another question, I believe the below line is trying to copy the CHROM,FROM,TO,JK columns from $dup file. However, my dup file does not contain any specific columns.

Command:

dup="$HOME/GRCh38/segdups.bed.gz"
bcftools annotate --no-version -Ou -a $dup -c CHROM,FROM,TO,JK -h /dev/stdin $dir/$pfx.unphased.bcf |

Contents in segdups.bed.gz

less $HOME/GRCh38/segdups.bed.gz

chr1    10000   10485   -       0.00711937
chr1    10485   18392   +       0.00579252
chr1    18392   87112   -       0.00457824
chr1    87113   88000   +       0.00686326
chr1    88000   110582  -       0.00653207

Because of this I am getting the below error:

Could not parse JK at chr1:632287 .. [-]

Command used to create segdups.bed.gz.

wget -O- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/genomicSuperDups.txt.gz | gzip -d |
  awk '!($2=="chrX" && $8=="chrY" || $2=="chrY" && $8=="chrX") {print $2"\t"$3"\t"$4"\t"$30}' > genomicSuperDups.bed

awk '{print $1,$2; print $1,$3}' genomicSuperDups.bed | \
  sort -k1,1 -k2,2n | uniq | \
  awk 'chrom==$1 {print chrom"\t"pos"\t"$2} {chrom=$1; pos=$2}' | \
  bedtools intersect -a genomicSuperDups.bed -b - | \
  bedtools sort | \
  bedtools groupby -c 4 -o min | \
  awk 'BEGIN {i=0; s[0]="+"; s[1]="-"} {if ($4!=x) i=(i+1)%2; x=$4; print $0"\t0\t"s[i]}' | \
  bedtools merge -s -c 4 -o distinct | \
  grep -v "GL\|KI" | bgzip > $HOME/GRCh38/segdups.bed.gz && \
  tabix -f -p bed $HOME/GRCh38/segdups.bed.gz

I am not sure why JK is not present in the file. Would you be able to provide some guidance.

Thank you

freeseek commented 3 years ago

Make sure bedtools version is 2.27 or newer, or the segdups.bed.gz file will be malformed.

rashindrie commented 3 years ago

Looks like I was using bedtools 2.26.0. I'll try updating.

Thank you!

freeseek / mocha

Format of the file with call rate information #12