Open roberta-davidson opened 2 years ago
Hi Roberta!
We'll sort this out! Are you for example using samples with too low amounts of data? In any case, can you give me more details, for example at which stage this is failing and which files are the last to be produced? Thanks!
Actually, can you confirm that this error happens during bam2plink, not during plink2tkrelated? In the latter, when loading frequency files, 11 columns are expected, but it seems like in your case it only had 9 (likely missing the 2 allele frequency columns, possibly because of no overlapping data for that pair?).
Hi,
The samples are enriched with a decent amount of data - in any case I have gotten the scripts to work for other samples that had far less data, so I don't think that's the issue. Sorry - I just realised my mistake - the issue is with plink2related. here is the full error from TKGWV2:
################################################################################
### TKGWV2 - An ancient DNA relatedness pipeline for ultra-low coverage data ###
## Version 1.0a - Released 06/2021
#
# [2022-07-01 10:45:24] Running 'bam2plink' on folder /path/TKGWV2/SAmerge4_tkgwv2
# BAM >> Pileup >> Text-PLINK
# Files to be processed:
ALOW1.trimmed.bam
ALOW2.trimmed.bam
CY5_3.trimmed.bam
CY5_4.trimmed.bam
CYS_1.trimmed.bam
ES1.trimmed.bam
IPUN1.trimmed.bam
IYA1.trimmed.bam
IYA2.trimmed.bam
IYA3.trimmed.bam
IYA4.trimmed.bam
IYA5.trimmed.bam
IYA6.trimmed.bam
MEL1-5.trimmed.bam
MEL1-6.trimmed.bam
MEL1.trimmed.bam
MEL2.trimmed.bam
MEL4.trimmed.bam
NAH1_1.trimmed.bam
VIC2_1.trimmed.bam
# Arguments used:
--referenceGenome = /path/GRCh37/human_g1k_v37_decoy.fasta
--gwvList = ./SAmerge4.BED
--gwvPlink = ./SAmerge4_maf
--bamExtension = .trimmed.bam
# [2022-07-01 10:54:29] All BAM files processed
# [2022-07-01 10:54:30] Running 'plink2tkrelated' on folder /path/TKGWV2/SAmerge4_tkgwv2
# Text-PLINK >> Pairwise transposed text-PLINK >> Relatedness estimates NULL
# Arguments used:
--freqFile = ./SAmerge4.frq
--dyads = Chonos.dyads
# Estimating coefficient of relatedness Rxy: NAH1_1 MEL1-5 (1/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 CYS_1 (2/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 IYA3 (3/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 MEL2 (4/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 IYA2 (5/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 ALOW1 (6/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 CY5_4 (7/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 IYA6 (8/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 MEL1 (9/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 ALOW2 (10/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 MEL1-6 (11/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 ES1 (12/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 IYA5 (13/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 VIC2_1 (14/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 IPUN1 (15/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 IYA4 (16/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 MEL4 (17/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 IYA1 (18/380)
# Estimating coefficient of relatedness Rxy: NAH1_1 CY5_3 (19/380)
# Estimating coefficient of relatedness Rxy: MEL1-5 NAH1_1 (20/380)
Error in names(x) <- value :
'names' attribute [11] must be the same length as the vector [9]
Calls: colnames<-
Execution halted
As you can see the pair that triggered the error is the same as the first pair that successfully ran (just because of how I write out my dyads file, I figured no harm in doubling up) I did try removing the pair from the dyads file at one point in case the duplication was the issue, but the same error occured.
Can you check if any of the following files is empty, if they are still visible?
"commMEL1-5_NAH1_1_SNPs" "commMEL1-5_NAH1_1_SNPs2" "commMEL1-5_NAH1_1_SNPs3" "MEL1-5__NAH1_1.tped" "MEL1-5__NAH1_1.freq"
As well as the same files but for your pair #1 (this pair in reverse). It is curious that it is happening on a reversed pair, but I don't see why it wouldn't work in only one of the ways.
I have no files with the *SNPs suffix. I also don't have "MEL1-5__NAH1_1.freq" I do have "commMEL1-5__NAH1_1.frq" if that's what you mean?
head -5 MEL1-5____NAH1_1.tped
:
1 1_2020343 0.020203 2020343 T T T T
1 1_3456871 0.034569 3456871 G G A A
1 1_9808302 0.098083 9808302 G G A A
1 1_15613319 0.156133 15613319 C C C C
1 1_20576256 0.205763 20576256 A A A A
head -5 NAH1_1____MEL1-5.tped
:
1 1_2020343 0.020203 2020343 T T T T
1 1_3456871 0.034569 3456871 G G A A
1 1_9808302 0.098083 9808302 G G A A
1 1_15613319 0.156133 15613319 C C C C
1 1_20576256 0.205763 20576256 A A A A
Yes, I meant the "comm(...).frq", sorry. If the "SNPs" file is not empty then a loop starts and the "tped" and "frq" are eventually created, so could you also share the head -5 for the corresponding .frq files? The error seems to be related to 2 missing columns in one of the files, and without looking deeper at the files it seems like the allele frequencies are missing. Ultimately, if you are able to share the PED+MAP files for those 2 samples I can try to replicate and debug the error.
head -5 commMEL1-5____NAH1_1.frq
:
1 1_2020343 C T 0.3486 370
1 1_3456871 A G 0.2404 366
1 1_9808302 G A 0.3659 358
1 1_15613319 T C 0.005051 396
1 1_18033151 G A 0.01852 324
head -5 commNAH1_1____MEL1-5.frq
:
1 1_2020343 C T 0.3486 370
1 1_3456871 A G 0.2404 366
1 1_9808302 G A 0.3659 358
1 1_15613319 T C 0.005051 396
1 1_18033151 G A 0.01852 324
Ped & Map files for those two samples in this box folder: https://universityofadelaide.box.com/s/m81z7eeizpsvpxahqo16rwtc5ypew6hz
Hi Roberta, can you share the population frequencies file as well please, so I can run the whole plink2tkrelated section? Those "frq" files do look fine on a first look.
Sorry for that, the frq file should be downloaded into the box folder in a few minutes
It seems to run with no issues for me, sorry about that.
Can you try to download the most up-to-date version of the code? Seems like you are using an older version, as for example it now shows the HRC for every test. I ran these 2 sample with and without the --dyads command, and no issues.
One thing that I did notice, is that SNP 1_18033151 that you have on the "frq" files up here in this thread is not in the files you sent me. That's fine if you sent me a subset of the data, but in that case the issue can be related to duplicated SNP names, or strange symbols in SNP names. But for now I would suggest you try to download the software again, check if the "tped" and "frq" have the same number of lines, and that you don't have duplicated SNP names. I do remember that the very first uploaded version of TKGWV2 had a bug on getting partial instead of full SNP name matches, but that has been fixed for a while.
If none of this works, we'll see what to do next :)
Actually, upon looking at your output again, the issue is not with MEL1-5 NAH1_1, but rather with the next pair (21/300). So have a look at those individuals instead! Sorry I didn't notice this earlier.
Ok! I may have mixed up exactly which files I shared which would explain the differences, though I have already checked for duplicate SNPs so that should be fine. I'll have a go with updating the code tomorrow and see if that helps!
Great! Otherwise check file lengths of the failing pair (21/300) and see if they don't match somehow.
Okay, having updated the code, I now get this error if I run without the dyads option:
# [2022-07-12 16:13:10] Running 'plink2tkrelated' on folder /hpcfs/users/a1717363/05_Chonos/TKGWV2/SouthCone2_tkgwv2
# Text-PLINK >> Pairwise transposed text-PLINK >> Relatedness estimates NULL
# Arguments used:
--freqFile = ./SouthCone2.frq
Error in combn(list_samples, 2) : n < m
Calls: data.frame -> t -> combn
Execution halted
And this if I run with the dyads option:
# [2022-07-12 16:26:57] Running 'plink2tkrelated' on folder /hpcfs/users/a1717363/05_Chonos/TKGWV2/SouthCone2_tkgwv2
# Text-PLINK >> Pairwise transposed text-PLINK >> Relatedness estimates NULL
# Arguments used:
--freqFile = ./SouthCone2.frq
--dyads = Chonos.dyads
awk: fatal: cannot open file `NAH1_1____MEL1-5.tped' for reading (No such file or directory)
wc: NAH1_1____MEL1-5.tped: No such file or directory
Error in strsplit(system(comm10b, intern = T), " ")[[1]] :
subscript out of bounds
In addition: Warning message:
In system(comm10b, intern = T) :
running command 'wc -l NAH1_1____MEL1-5.tped' had status 1
Execution halted
I am now more confused..?
Hi Roberta. The first error suggests that there is only 1 .ped file in the folder you are working on. The second one mentions that it can't find that specific .tped, so perhaps because of the same issue as in the first error you shared? If you want me to have a look again, feel free to share files with me. If can also email me a link instead of dropping it here, if you'd prefer. Email is dani.mag.fernandes {at} gmail.com
Also, did you end up updating the software? Even with just one .ped in the folder, you should be able to see a list of samples (.ped files) detected.
Hi again! I updated the "plink2tkrelated.R" script to include many more sanity checks, and also added a verbose mode (--verbose, -v) that will print out some information as the dyad processing is happening. Both should hopefully help understand where and why errors are happening.
I strongly suggest you download this new version and give it a try with verbose mode in plink2tkrelated. Hope it helps!
Hi,
I'm having this recurring issue when running bam2plink. Sometimes the script will run for some pairs and then have this error at a certain pair and sometimes it will fail with this error immediately. I have checked that all SNPs are bi-allelic and there are no duplicates, as suggested in the previous issue. I cannot work out what may be causing this.