Closed jolespin closed 10 months ago
Is snipgenie trying to parse the sample name?
I tried this again using the original reads input (not manifest) and I got the same error.
The manifest file is used but later it seems to mess up the vcf header when it does the reheader step. That's so the sample names match (by default they are the filenames). There's a file called samples.txt that's used for this. What does it look like?
(base) [jespinoz@login01 DENV4]$ cat samples.txt
DENV3
Is it trying to parse the sample ids?
Yes when doing the reheader it was still trying to parse the sample names. It doesn't do that now. You could update and try it again.
Can you confirm your tool can work with this dataset?
Here's the log:
The following options were supplied
time: 14/12/2023 10:44:02
-------
input : ['reads/DENV2']
manifest : None
labelsep : _
labelindex : 0
reference : References/DENV2.fa
species : None
gb_file : None
threads : 1
overwrite : False
trim : False
unmapped : False
quality : 25
filters : QUAL>=40 && FORMAT/DP>=30 && DP4>=4
mask : None
custom_filters : False
platform : illumina
aligner : bwa
buildtree : False
bootstraps : 100
outdir : snipgenie_output/reads_based/DENV2
qc : False
dummy : False
test : False
version : False
omit_samples : []
get_stats : True
logfile : snipgenie_output/reads_based/DENV2/run.log
there seem to be duplicates:
sample
DENV2 198
Name: count, dtype: int64
error in filename parsing, check labelsep and labelindex options
name ... pair
41 DENV2_100_S91_1 ... 1
40 DENV2_100_S91_2 ... 2
3 DENV2_101_S103_1 ... 3
2 DENV2_101_S103_2 ... 4
157 DENV2_102_S115_1 ... 5
156 DENV2_102_S115_2 ... 6
140 DENV2_103_S127_1 ... 7
141 DENV2_103_S127_2 ... 8
158 DENV2_104_S139_1 ... 9
159 DENV2_104_S139_2 ... 10
[10 rows x 4 columns]
I've tried several different versions. If Snipgenie can't accomodate this data, do you recommend another tool I can try besides snippy?
Here are the files I'm providing:
share_snipgenie
├── DENV2.fa
├── reads
│ └── veba_output
│ └── preprocess
│ ├── DENV2_100_S91
│ │ └── output
│ │ ├── cleaned_1.fastq.gz
│ │ └── cleaned_2.fastq.gz
│ ├── DENV2_101_S103
│ │ └── output
│ │ ├── cleaned_1.fastq.gz
│ │ └── cleaned_2.fastq.gz
│ └── DENV2_102_S115
│ └── output
│ ├── cleaned_1.fastq.gz
│ └── cleaned_2.fastq.gz
└── reads_table.DENV2.csv
10 directories, 8 files
https://drive.google.com/drive/folders/1wUFYi1UxY79jaok-a0dEbDiTPeGT46xb?usp=sharing
But you're not using the manifest file here? It is required for these cases. Otherwise it's trying to parse the file names directly. The fix I made was to avoid parsing at all when there is a manifest.
Wow. Sorry about that! I forgot to change the -i to -M in the command. Running it now.
It seems to work for me with your data if I used the manifest file. It's only that the samples seem identical so it can't find any informative SNPs.
What program are you using for visualizing the VCFs? I'm testing it out on the full set now.
Ok the newest version you have worked great. Thanks for all your help in this. Greatly appreciated. Will certainly publish using this tool in the future now that I have it all dialed in for my workflow.
Thanks. I used IGV to look at the vcf and bam files.
I'm using the Snipgenie version installed from GitHub main repository on Dec 1, 2023
Here's my log: