danimfernandes / tkgwv2

An ancient DNA relatedness pipeline for ultra-low coverage whole genome shotgun data
GNU General Public License v2.0
6 stars 2 forks source link

Issue running plink2tkrelated #3

Open roberta-davidson opened 2 years ago

roberta-davidson commented 2 years ago

Hi,

I'm having this recurring issue when running bam2plink. Sometimes the script will run for some pairs and then have this error at a certain pair and sometimes it will fail with this error immediately. I have checked that all SNPs are bi-allelic and there are no duplicates, as suggested in the previous issue. I cannot work out what may be causing this.

Error in names(x) <- value : 
  'names' attribute [11] must be the same length as the vector [9]
Calls: colnames<-
Execution halted
danimfernandes commented 2 years ago

Hi Roberta!

We'll sort this out! Are you for example using samples with too low amounts of data? In any case, can you give me more details, for example at which stage this is failing and which files are the last to be produced? Thanks!

danimfernandes commented 2 years ago

Actually, can you confirm that this error happens during bam2plink, not during plink2tkrelated? In the latter, when loading frequency files, 11 columns are expected, but it seems like in your case it only had 9 (likely missing the 2 allele frequency columns, possibly because of no overlapping data for that pair?).

roberta-davidson commented 2 years ago

Hi,

The samples are enriched with a decent amount of data - in any case I have gotten the scripts to work for other samples that had far less data, so I don't think that's the issue. Sorry - I just realised my mistake - the issue is with plink2related. here is the full error from TKGWV2:

 ################################################################################
 ### TKGWV2 - An ancient DNA relatedness pipeline for ultra-low coverage data ###
 ## Version 1.0a - Released 06/2021
 #
 # [2022-07-01 10:45:24] Running 'bam2plink' on folder /path/TKGWV2/SAmerge4_tkgwv2
     # BAM >> Pileup >> Text-PLINK 
     # Files to be processed:
         ALOW1.trimmed.bam
         ALOW2.trimmed.bam
         CY5_3.trimmed.bam
         CY5_4.trimmed.bam
         CYS_1.trimmed.bam
         ES1.trimmed.bam
         IPUN1.trimmed.bam
         IYA1.trimmed.bam
         IYA2.trimmed.bam
         IYA3.trimmed.bam
         IYA4.trimmed.bam
         IYA5.trimmed.bam
         IYA6.trimmed.bam
         MEL1-5.trimmed.bam
         MEL1-6.trimmed.bam
         MEL1.trimmed.bam
         MEL2.trimmed.bam
         MEL4.trimmed.bam
         NAH1_1.trimmed.bam
         VIC2_1.trimmed.bam
     # Arguments used:
         --referenceGenome = /path/GRCh37/human_g1k_v37_decoy.fasta
         --gwvList = ./SAmerge4.BED
         --gwvPlink = ./SAmerge4_maf
         --bamExtension = .trimmed.bam
 # [2022-07-01 10:54:29] All BAM files processed

 # [2022-07-01 10:54:30] Running 'plink2tkrelated' on folder /path/TKGWV2/SAmerge4_tkgwv2
     # Text-PLINK >> Pairwise transposed text-PLINK >> Relatedness estimates NULL
     # Arguments used:
         --freqFile = ./SAmerge4.frq
         --dyads = Chonos.dyads
     # Estimating coefficient of relatedness Rxy: NAH1_1 MEL1-5 (1/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 CYS_1 (2/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 IYA3 (3/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 MEL2 (4/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 IYA2 (5/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 ALOW1 (6/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 CY5_4 (7/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 IYA6 (8/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 MEL1 (9/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 ALOW2 (10/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 MEL1-6 (11/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 ES1 (12/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 IYA5 (13/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 VIC2_1 (14/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 IPUN1 (15/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 IYA4 (16/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 MEL4 (17/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 IYA1 (18/380)
     # Estimating coefficient of relatedness Rxy: NAH1_1 CY5_3 (19/380)
     # Estimating coefficient of relatedness Rxy: MEL1-5 NAH1_1 (20/380)
Error in names(x) <- value : 
  'names' attribute [11] must be the same length as the vector [9]
Calls: colnames<-
Execution halted

As you can see the pair that triggered the error is the same as the first pair that successfully ran (just because of how I write out my dyads file, I figured no harm in doubling up) I did try removing the pair from the dyads file at one point in case the duplication was the issue, but the same error occured.

danimfernandes commented 2 years ago

Can you check if any of the following files is empty, if they are still visible?

"commMEL1-5_NAH1_1_SNPs" "commMEL1-5_NAH1_1_SNPs2" "commMEL1-5_NAH1_1_SNPs3" "MEL1-5__NAH1_1.tped" "MEL1-5__NAH1_1.freq"

As well as the same files but for your pair #1 (this pair in reverse). It is curious that it is happening on a reversed pair, but I don't see why it wouldn't work in only one of the ways.

roberta-davidson commented 2 years ago

I have no files with the *SNPs suffix. I also don't have "MEL1-5__NAH1_1.freq" I do have "commMEL1-5__NAH1_1.frq" if that's what you mean?

head -5 MEL1-5____NAH1_1.tped :

1 1_2020343 0.020203 2020343 T T T T
1 1_3456871 0.034569 3456871 G G A A
1 1_9808302 0.098083 9808302 G G A A
1 1_15613319 0.156133 15613319 C C C C
1 1_20576256 0.205763 20576256 A A A A

head -5 NAH1_1____MEL1-5.tped :

1 1_2020343 0.020203 2020343 T T T T
1 1_3456871 0.034569 3456871 G G A A
1 1_9808302 0.098083 9808302 G G A A
1 1_15613319 0.156133 15613319 C C C C
1 1_20576256 0.205763 20576256 A A A A
danimfernandes commented 2 years ago

Yes, I meant the "comm(...).frq", sorry. If the "SNPs" file is not empty then a loop starts and the "tped" and "frq" are eventually created, so could you also share the head -5 for the corresponding .frq files? The error seems to be related to 2 missing columns in one of the files, and without looking deeper at the files it seems like the allele frequencies are missing. Ultimately, if you are able to share the PED+MAP files for those 2 samples I can try to replicate and debug the error.

roberta-davidson commented 2 years ago

head -5 commMEL1-5____NAH1_1.frq:

   1     1_2020343    C    T       0.3486      370
   1     1_3456871    A    G       0.2404      366
   1     1_9808302    G    A       0.3659      358
   1    1_15613319    T    C     0.005051      396
   1    1_18033151    G    A      0.01852      324

head -5 commNAH1_1____MEL1-5.frq:

   1     1_2020343    C    T       0.3486      370
   1     1_3456871    A    G       0.2404      366
   1     1_9808302    G    A       0.3659      358
   1    1_15613319    T    C     0.005051      396
   1    1_18033151    G    A      0.01852      324

Ped & Map files for those two samples in this box folder: https://universityofadelaide.box.com/s/m81z7eeizpsvpxahqo16rwtc5ypew6hz

danimfernandes commented 2 years ago

Hi Roberta, can you share the population frequencies file as well please, so I can run the whole plink2tkrelated section? Those "frq" files do look fine on a first look.

roberta-davidson commented 2 years ago

Sorry for that, the frq file should be downloaded into the box folder in a few minutes

danimfernandes commented 2 years ago

It seems to run with no issues for me, sorry about that.

Can you try to download the most up-to-date version of the code? Seems like you are using an older version, as for example it now shows the HRC for every test. I ran these 2 sample with and without the --dyads command, and no issues.

One thing that I did notice, is that SNP 1_18033151 that you have on the "frq" files up here in this thread is not in the files you sent me. That's fine if you sent me a subset of the data, but in that case the issue can be related to duplicated SNP names, or strange symbols in SNP names. But for now I would suggest you try to download the software again, check if the "tped" and "frq" have the same number of lines, and that you don't have duplicated SNP names. I do remember that the very first uploaded version of TKGWV2 had a bug on getting partial instead of full SNP name matches, but that has been fixed for a while.

If none of this works, we'll see what to do next :)

danimfernandes commented 2 years ago

Actually, upon looking at your output again, the issue is not with MEL1-5 NAH1_1, but rather with the next pair (21/300). So have a look at those individuals instead! Sorry I didn't notice this earlier.

roberta-davidson commented 2 years ago

Ok! I may have mixed up exactly which files I shared which would explain the differences, though I have already checked for duplicate SNPs so that should be fine. I'll have a go with updating the code tomorrow and see if that helps!

danimfernandes commented 2 years ago

Great! Otherwise check file lengths of the failing pair (21/300) and see if they don't match somehow.

roberta-davidson commented 2 years ago

Okay, having updated the code, I now get this error if I run without the dyads option:


 # [2022-07-12 16:13:10] Running 'plink2tkrelated' on folder /hpcfs/users/a1717363/05_Chonos/TKGWV2/SouthCone2_tkgwv2
     # Text-PLINK >> Pairwise transposed text-PLINK >> Relatedness estimates NULL
     # Arguments used:
         --freqFile = ./SouthCone2.frq
Error in combn(list_samples, 2) : n < m
Calls: data.frame -> t -> combn
Execution halted

And this if I run with the dyads option:

 # [2022-07-12 16:26:57] Running 'plink2tkrelated' on folder /hpcfs/users/a1717363/05_Chonos/TKGWV2/SouthCone2_tkgwv2
     # Text-PLINK >> Pairwise transposed text-PLINK >> Relatedness estimates NULL
     # Arguments used:
         --freqFile = ./SouthCone2.frq
         --dyads = Chonos.dyads
awk: fatal: cannot open file `NAH1_1____MEL1-5.tped' for reading (No such file or directory)
wc: NAH1_1____MEL1-5.tped: No such file or directory
Error in strsplit(system(comm10b, intern = T), " ")[[1]] : 
  subscript out of bounds
In addition: Warning message:
In system(comm10b, intern = T) :
  running command 'wc -l NAH1_1____MEL1-5.tped' had status 1
Execution halted

I am now more confused..?

danimfernandes commented 2 years ago

Hi Roberta. The first error suggests that there is only 1 .ped file in the folder you are working on. The second one mentions that it can't find that specific .tped, so perhaps because of the same issue as in the first error you shared? If you want me to have a look again, feel free to share files with me. If can also email me a link instead of dropping it here, if you'd prefer. Email is dani.mag.fernandes {at} gmail.com

danimfernandes commented 2 years ago

Also, did you end up updating the software? Even with just one .ped in the folder, you should be able to see a list of samples (.ped files) detected.

danimfernandes commented 2 years ago

Hi again! I updated the "plink2tkrelated.R" script to include many more sanity checks, and also added a verbose mode (--verbose, -v) that will print out some information as the dyad processing is happening. Both should hopefully help understand where and why errors are happening.

I strongly suggest you download this new version and give it a try with verbose mode in plink2tkrelated. Hope it helps!