GWW / scsnv

scSNV Mapping tool for 10X Single Cell Data
MIT License
22 stars 4 forks source link

Library type selection ( and no 'pileup_passed_snvs.txt.gz' generated) #7

Closed MarcusLCC closed 3 years ago

MarcusLCC commented 3 years ago

Hi Gavin,

I've been trying to run through the pipeline on the server these days using scRNAseq data. I'm rather new to single cell analysis, and there're some issues that I encountered and couldn't figure out.

  1. About the parameter of library type in scSNV, there're 5'/3' + V2/V3 available. However, my current data (our scRNAseq data generated 1~2 years ago) to my knowledge is based on 5' v1 chemistry. Besides, I cannot get much information about 5' v3 chemistry from the google, as the information for 10X 5' sequencing seems to be actively updated as version 2 on 10X website (https://support.10xgenomics.com/single-cell-vdj/index/doc/user-guide-chromium-single-cell-5-reagent-kits-user-guide-v2-chemistry-dual-index). I wonder whether I have some misunderstandings of the library type using here?

And the fastq files for each sample actually include two files (R1&R2). Here I attach the first several lines of the fastq file for reference.

sample1_S29_L003_R1_001.fastq.gz

@A00549:68:HWYJVDSXX:3:1101:1470:1031 1:N:0:CGTGCAGA
GCTTGAAAGACACTAATGCTGAAACATTTCTTATATGGGGGGATCGGGTCGGCGCCATTTTGGGACTGAGACTGGTTGTGGGGGAGGGAAAAGCGGCAAAAGGGGATTATTCAAAGTACCGAAAACCTTCTCCCGGGATCAGGCGCGGCGG
+
FFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF
@A00549:68:HWYJVDSXX:3:1101:2157:1031 1:N:0:CGTGCAGA
CCGTACTAGATGTGTACCTCCTTGTATTTCTTATATGGGCTGCCGACCTCACGGGCTATTTAAAGGTACGCGCCGCGGCCAAGGCCGCACCGTACTGGGCGGGGGTCTGGGGAGTGCAGCAGCCATGGCAAGCCGTCTCCTGCTCAACAAC
+
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

sample1_S29_L003_R2_001.fastq.gz

@A00549:68:HWYJVDSXX:3:1101:1470:1031 2:N:0:CGTGCAGA
NTCAATTCACTTCACAGACGATTCTTGCCAATTTTAATAAACTTCTGGGGCAAAATTATCCAAAAACACTGTAAATCCAAAATGGCCACTTAAAATATCCAGGGCCTTTTACACAAAACCTAGATGATGATCTTCATATCTGAGTAATTCA
+
#FFFFF:FFFFFFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF
@A00549:68:HWYJVDSXX:3:1101:2157:1031 2:N:0:CGTGCAGA
NGTACGTGCACCACAGCTTGCTGACGATGAAGAGCTCCTCACGCTTCACCACCTGCTCCCTGAGCTTCTCCTGAATGGCCACCCCCACCTCATTCTCATTCTGGTACACATGGGCACAGTCGATGTGGCGGTACCCGACGTCAATGGCCAC
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFF

May I have your suggestion about the library type to choose?

  1. Without knowing for sure which version of 5' I should choose, I tried to run the pipeline with 'V2_5P' and see what I can get. The 'count', 'map', 'collapse', 'pileup' steps seem to be running without error, and 'scsnvmisc cells' step generated some warnings. When it came to 'scsnv snvcounts' step, the 'pileup_passed_snvs.txt.gz' file was missing. Here I attach part of the running log here.
(scSNVref) -bash-4.2$ scsnv map -l V2_5P \
>           -i ${scSNV_index_prefix} \
>           -g ${genome} \
>           -b ${scSNV_workdir}/barcode_FYC_out \
>           -t 63 --bam-write 8 -q 63 \
>           -o ${scSNV_workdir}/out/ \
>           FYC_D0/
[00:00:00] Loaded 340198 known barcodes from 
/home/marcus/scSNV/data_sample/barcode_FYC_out_counts.txt.gz
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[M::bwa_idx_load_from_disk] read 0 ALT contigs
[00:00:03] Preparing bam header for 194 references
[00:00:04] Processing FYC_D0//FYC-BM-Day0-5Seq-1_S29_L003_R1_001.fastq.gz and FYC_D0//FYC-BM-Day0-5Seq-1_S29_L003_R2_001.fastq.gz total reads = 130281919
[00:00:04] Estimated memory needed for alignment tags: 10.22 GB
[00:00:41] Processed 2500000 / 457164529 [67567 / sec], ETA = 1h 52m  CDNA: 62.23 PRE: 6.66 INT: 1.03 AMB: 0.39 ANT: 3.06 MUL: 13.39 UNM: 13.21 QA: 0.03
[00:01:16] Processed 5000000 / 457164529 [69444 / sec], ETA = 1h 48m  CDNA: 62.27 PRE: 6.67 INT: 1.03 AMB: 0.39 ANT: 3.04 MUL: 13.40 UNM: 13.17 QA: 0.03
...
[01:52:35] Processed 452500000 / 457164529 [67027 / sec], ETA = 0h 1m  CDNA: 62.24 PRE: 6.71 INT: 1.04 AMB: 0.39 ANT: 3.04 MUL: 13.46 UNM: 13.08 QA: 0.03
[01:53:12] Processed 455000000 / 457164529 [67030 / sec], ETA = 0h 0m  CDNA: 62.24 PRE: 6.71 INT: 1.04 AMB: 0.39 ANT: 3.04 MUL: 13.46 UNM: 13.08 QA: 0.03
[01:54:27] Merged alignments size = 315207950 Merged barcode rate size = 340198 sum = 423802572
[01:54:27] Wrote 457164529 sorted alignments across 92 bam files
[01:54:27] Sorting the alignment tags
Alignment summary total reads = 457164529
Barcode QA Fail                          40923430         8.95%
UMI QA Fail                                 54686         0.01%
TAG QA Fail                                152508         0.03%
Unmapped                                 18835470         4.12%
Intergenic                                4751913         1.04%
cDNA                                    284546321        62.24%
Intronic                                 30661629         6.71%
Multimapped                              61533261        13.46%
Ambiguous                                 1795235         0.39%
Antisense                                13910076         3.04%

Barcodes Correct                        408679626        89.39%
Barcodes Corrected                        7561473         1.65%

[01:55:14] Beginning quantification step
[01:55:14] Quantifying UMIs
...
[01:55:20] Summarizing
R = 315207950 R2 = 315207950 R3 = 315207950 R4 = 315207950
[01:55:23] Total genes detected 31736 from 275209 barcodes
[01:55:23] Writing barcode rates
  cDNA UMI Duplicate Rate =     83.39%
  cDNA PCR Duplicate Rate =     23.95%
  Intronic UMI Duplicate Rate = 81.81%
  Intronic PCR Duplicate Rate = 22.63%
  Total UMI Duplicate Rate =    83.22%
  Total PCR Duplicate Rate =    23.81%
  Total Reads Used = 310678869.00
  Total Molecules  = 52129055.00
  Total PCR Dups   = 73961239.00
  Total Discarded  = 4529081.00
[01:55:26] Writing cDNA counts
[01:55:33] Writing intronic counts
[01:55:34] Done
[01:55:34] Quantification done

[01:55:35] Merging bam files and correcting UMIs
[01:55:35] Total UMI corrections 1466688
[01:55:35] Total bad UMI combinations 2035423
[01:55:35] Found 92 bam files to merge
[01:55:37] Merging, writing and correcting alignments
0x5624f1076500 194
[01:55:54] Merged 5000000 / 457164529 [263157 / sec], ETA = 0h 28m
[01:56:10] Merged 10000000 / 457164529 [285714 / sec], ETA = 0h 26m
...
[02:21:15] Merged 455000000 / 457164529 [295454 / sec], ETA = 0h 0m
[02:21:22] Done writing unmapped reads
[02:21:22] Done writing 457164529 reads and 4529081 marked as discarded
[02:21:22] Deleting the temporary bam files
[02:21:28] Done
(scSNVref) -bash-4.2$ scsnv collapse -l V2_5P \
>                -r ${genome} \
>                -i ${scSNV_index_prefix} \
>                -o ${scSNV_workdir}/out \
>                --threads 63 --bam-write 8 \
>                -b ${scSNV_workdir}/barcode_FYC_out_counts.txt.gz \
>                ${scSNV_workdir}/out/merged.bam
[00:00:00] Loading the genome
[00:00:06] Loading the transcriptome index
[00:00:06] Loaded 340198 known barcodes from 
/home/marcus/scSNV/data_sample/barcode_FYC_out_counts.txt.gz
[00:00:07] Writing collapsed alignments to 
/home/marcus/scSNV/data_sample/outcollapsed.bam
[00:00:11] Processed 1020242 reads from 196149 groups, Collapsed = 99.96%, Lost = 0.044107% due to 46 ambiguous groups, Collapsed Reads = 196103
[00:00:12] Processed 1756838 reads from 319851 groups, Collapsed = 99.97%, Lost = 0.033128% due to 61 ambiguous groups, Collapsed Reads = 319790
...
[00:15:13] Processed 281407816 reads from 47173616 groups, Collapsed = 99.93%, Lost = 0.065156% due to 20375 ambiguous groups, Collapsed Reads = 47153241
[00:18:43] Processed 310678869 reads from 52129055 groups, Collapsed = 99.94%, Lost = 0.059043% due to 20386 ambiguous groups, Collapsed Reads = 52108669
[00:18:43] Total Reads Processed = 310678869 dups = 57690200
scsnvmisc cells --skip-mt -o ${scSNV_workdir}/out ${scSNV_workdir}/out/summary.h5
Error.  nthreads cannot be larger than environment variable "NUMEXPR_MAX_THREADS" (64)/home/marcuslc/miniconda3/envs/scSNVref/lib/python3.9/site-packages/scsnvpy-1.0-py3.9-linux-x86_64.egg/scsnvpy/cells.py:
  ax.set_xscale('log', subsx=[2,3,4,5,6,7,8,9])
Skipping MT DNA filtering
/home/marcuslc/miniconda3/envs/scSNVref/lib/python3.9/site-packages/scsnvpy-1.0-py3.9-linux-x86_64.egg/scsnvpy/cells.py:92: MatplotlibDeprecationWarning: The 'subsx' parameter of __init__() has been renamed
  ax.set_xscale('log', subsx=[2,3,4,5,6,7,8,9])
/home/marcuslc/miniconda3/envs/scSNVref/lib/python3.9/site-packages/scsnvpy-1.0-py3.9-linux-x86_64.egg/scsnvpy/cells.py:273: MatplotlibDeprecationWarning: The 'subsy' parameter of __init__() has been rename
  ax.set_yscale('log', subsy=[2,3,4,5,6,7,8,9])
/home/marcuslc/miniconda3/envs/scSNVref/lib/python3.9/site-packages/scsnvpy-1.0-py3.9-linux-x86_64.egg/scsnvpy/cells.py:274: MatplotlibDeprecationWarning: The 'subsx' parameter of __init__() has been rename
  ax.set_xscale('log', subsx=[2,3,4,5,6,7,8,9])
Total Passed Cells 5588
(30805, 5588) (30805, 5588)
/home/marcuslc/miniconda3/envs/scSNVref/lib/python3.9/site-packages/scsnvpy-1.0-py3.9-linux-x86_64.egg/scsnvpy/cells.py:331: PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['gene_id', 'gene_name'], dtype='object')]

  fl.save(os.path.join(args.output, 'cells.h5'), out)
(scSNVref) -bash-4.2$ scsnv pileup -l V2_5P \
>              -i ${scSNV_index_prefix} \
>              -r ${genome} \
>              -o ${scSNV_workdir}/out/pileup \
>              -p ${scSNV_workdir}/out/passed_barcodes.txt.gz \
>              -t 63 -x 63 \
>              ${scSNV_workdir}/out/collapsed.bam
[00:00:00] Loading the genome
[00:00:05] Read 5588 passed barcodes
[00:00:05] Loading the transcriptome index
Splice site index built
Min af = 0.01
[00:00:25] Processed 1102668 reads [78762 / second], bases with min barcodes = 2242811 plus = 1235140 minus = 1106474, total passed bases = 4331 current ref = chr1: 40738896
[00:00:46] Processed 2353268 reads [67236.2 / second], bases with min barcodes = 5513003 plus = 2688521 minus = 3067982, total passed bases = 9802 current ref = chr1: 145359081
...
[00:13:42] Processed 47933629 reads [59104.4 / second], bases with min barcodes = 94793748 plus = 50880397 minus = 49332150, total passed bases = 168306 current ref = chrM: 15867
[00:13:42] Finished. Processed  47933629 reads, bases with min barcodes =  94793748 plus = 50880397 minus = 49332150, total passed bases = 168306
plus > 0: 176039884 / 335325209
minus > 0: 167269942 / 335325209
total_barcodes > 0: 335325209 / 335325209
plus_barcodes > 0: 317352411 / 335325209
minus_barcodes > 0: 315130218 / 335325209

And the files under working directory ${scSNV_workdir} include:

-rw-r--r-- 1 barcode_FYC_out_counts.txt.gz
-rw-r--r-- 1 barcode_FYC_out_totals.txt
-rw-r--r-- 1 barcodes_FYC_in.txt
-rwxr-x--- 1 barcodes.tsv.gz
drwxr-xr-x 2 FYC_D0
drwxr-x--- 3 out
drwx------ 2 out_btmp

The files under the output directory ${scSNV_workdir}/out include:

-rw-r--r-- 1 alignment_summary.txt
-rw-r--r-- 1 anndata.h5ad
drwx------ 2 _btmp
-rw-r--r-- 1 cells.h5
-rw-r--r-- 1 cells.png
-rw-r--r-- 1 collapsed.bam
-rw-r--r-- 1 merged.bam
-rw-r--r-- 1 passed_barcodes.txt.gz
-rw-r--r-- 1 pileup_barcode_matrices.h5
-rw-r--r-- 1 pileup_barcodes.txt.gz
-rw-r--r-- 1 pileup.txt.gz
-rw-r--r-- 1 summary.h5

, which does not contain the 'pileup_passed_snvs.txt.gz' which is required for the scsnv snvcounts

Could you please tell me what I've done wrong here? Many thanks,

Regards, Marcus

GWW commented 3 years ago

Hi Marcus,

I think the 5'-v1 libraries you have are similar to 5'-v2 libraries (the difference is that the v2 are dual indexed instead of single indexed). I suspect the 5P_V2 option you specified is correct as the tool would have had some issues matching barcodes otherwise.

I should have clarified the documentation a bit but in order to run the snvcount method you need "annotate" the raw pileup file emitted by the pileup command. The annotate command applies some light filtering and basic SNV calling as the pileup command only provides raw base counts.

There is documentation on getting the annotation files and running the snvcount in the README. The additional files such as repeatmasker, REDIPortal and 1000 genomes are optional but can be helpful. I'll update the instructions to mention the annotate command requirement.

MarcusLCC commented 3 years ago

Thank you Gavin.

I'm trying to run the annotation currently. For the repeatmasker and REDIPortal files, I'm not sure whether I downloaded the correct one. Could you please provide some information about where I should get the file?

For example, I downloaded the repeatmasker file from https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/. Am I supposed to download the hg38.fa.out.gz which is 'RepeatMasker .out file' for use in the annotation step? And for the REDIportal file, I went to the REDIportal website download page http://srv00.recas.ba.infn.it/atlas/download.html and select hg38. There's a file named TABLE1_hg38.txt.gz. Is this file supposed to be used as the REDIportal file in the annotation step?

Many thanks for your help.

Regards, Marcus

GWW commented 3 years ago

Hi Marcus,

Apologies for the delay. Those should be the files if you used Ensembl for your reference the chromosomes are likely in the format 1,2,3 etc. The scsnvmisc annotate tool expects this and automatically strips the chr from the repeat masker and RediPortal file. If this gives you any issues I will adjust the tool to not do that and instead require the chromosomes be correctly named in the files.

My files have the following format for reference:

Repeat Masker:

#bin    swScore milliDiv        milliDel        milliIns        genoName        genoStart       genoEnd genoLeft        strand  repName repClass        repFamily       repStart        repEnd  repLeft   id
0       1892    83      59      14      chr1    67108753        67109046        -181847376      +       L1P5    LINE    L1      5301    5607    -544    1
1       2582    27      0       23      chr1    8388315 8388618 -240567804      -       AluY    SINE    Alu     -15     296     1       1
1       4085    171     77      36      chr1    25165803        25166380        -223790042      +       L1MB5   LINE    L1      5567    6174    0       4
1       2285    91      0       13      chr1    33554185        33554483        -215401939      -       AluSc   SINE    Alu     -6      303     10      6
1       2451    64      3       26      chr1    41942894        41943205        -207013217      -       AluY    SINE    Alu     -7      304     1       8
1       1587    272     100     49      chr1    50331336        50332274        -198624148      +       HAL1    LINE    L1      773     1763    -744    9
1       1393    280     82      51      chr1    58719764        58720546        -190235876      +       L2a     LINE    L2      2582    3418    -8      1
2       5372    165     14      27      chr1    75496057        75497775        -173458647      +       L1MA9   LINE    L1      5168    6868    -30     1
2       536     349     146     56      chr1    92274205        92275925        -156680497      +       L2      LINE    L2      406     2306    -1113   1

REDIPortal

chr1    10186   10187   A_G_+
chr1    10192   10193   A_G_+
chr1    10210   10211   A_G_+
chr1    10216   10217   A_G_+
chr1    10222   10223   A_G_+
chr1    10228   10229   A_G_+
chr1    10235   10236   A_G_+
chr1    10241   10242   A_G_+
chr1    10248   10249   A_G_+
chr1    10254   10255   A_G_+
MarcusLCC commented 3 years ago

Hi Gavin,

Thank you for your help. I'm still a bit confused where to get the files which are in the same format as yours, as I dived into the REDIportal and UCSC database but still didn't find the corresponding files. Sorry I'm not very experienced in using these files.

Could you please provide the link for downloading the RepeatMasker and REDIportal files? Thanks.

For your information, for bwa index generation and gtf file required in other steps, I downloaded the files from Gencode (GRCh38.p13 genome reference and release 36 gtf annotation).

Regards, Marcus

GWW commented 3 years ago

Hi Marcus,

I have updated the documentation to better reflect the two database files as well as the potential for chromosome name mismatches.

I removed some code that was force converting chromosome names as that would cause issues with gencode annotations (I believe they use the chr prefix). In case your annotations do not have the chromosome names would need to be mapped to your annotation file. You can get mapping files from here if they are needed. you would need to map the first column from the REDIPortal file (the chromosome name) using these files or the 5th column from the UCSC table browser file (again the chromosome name).

I downloaded the RepeatMasker table from the UCSC table browser here.

It looks like the REDIPortal annotations may have changed at some point. I remember they didn't previously provide hg38 locations and I had to lift it over. I have pushed an update to support the downloaded database file directly.

You'll need to pull the changes from the repository and reinstall the scsnvpy module

cd scsnvpy
python setup.py install

For reference these are the contents of the two files:

Repeat masker from the UCSC table browser rmsk table

#bin    swScore milliDiv        milliDel        milliIns        genoName        genoStart       genoEnd genoLeft        strand  repName repClass        repFamily       repStart    repEnd   repLeft id
0       1892    83      59      14      chr1       67108753        67109046        -181847376      +       L1P5    LINE    L1      5301    5607    -544    1
1       2582    27      0       23      chr1       8388315 8388618 -240567804      -       AluY    SINE    Alu     -15     296     1       1
1       4085    171     77      36      chr1       25165803        25166380        -223790042      +       L1MB5   LINE    L1      5567    6174    0       4
1       2285    91      0       13      chr1       33554185        33554483        -215401939      -       AluSc   SINE    Alu     -6      303     10      6
1       2451    64      3       26      chr1       41942894        41943205        -207013217      -       AluY    SINE    Alu     -7      304     1       8
1       1587    272     100     49      chr1       50331336        50332274        -198624148      +       HAL1    LINE    L1      773     1763    -744    9
1       1393    280     82      51      chr1       58719764        58720546        -190235876      +       L2a     LINE    L2      2582    3418    -8      1
2       5372    165     14      27      chr1       75496057        75497775        -173458647      +       L1MA9   LINE    L1      5168    6868    -30     1
2       536     349     146     56      chr1       92274205        92275925        -156680497      +       L2      LINE    L2      406     2306    -1113   1

The REDIPortal database:

Region  Position        Ref     Ed      Strand  db      type    dbsnp   repeat  Func.wgEncodeGencodeBasicV34    Gene.wgEncodeGencodeBasicV34    GeneDetail.wgEncodeGencodeBasicV34  ExonicFunc.wgEncodeGencodeBasicV34       AAChange.wgEncodeGencodeBasicV34        Func.refGene    Gene.refGene    GeneDetail.refGene      ExonicFunc.refGene      AAChange.refGene    Func.knownGene   Gene.knownGene  GeneDetail.knownGene    ExonicFunc.knownGene    AAChange.knownGene      phastConsElements100way
chr1    87158   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737 -       -   intergenic       OR4F5;AL627309.1        -       -       -
chr1    87168   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737 -       -   intergenic       OR4F5;AL627309.1        -       -       -
chr1    87171   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737 -       -   intergenic       OR4F5;AL627309.1        -       -       -
chr1    87189   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737 -       -   intergenic       OR4F5;AL627309.1        -       -       -
chr1    87218   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737 -       -   intergenic       OR4F5;AL627309.1        -       -       -
chr1    87225   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737 -       -   intergenic       OR4F5;AL627309.1        -       -       -
chr1    87231   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737 -       -   intergenic       OR4F5;AL627309.1        -       -       -
chr1    87242   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737 -       -   intergenic       OR4F5;AL627309.1        -       -       -
chr1    87248   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737 -       -   intergenic       OR4F5;AL627309.1        -       -       -

Apologies again for all the confusion hopefully this sorts out the annotation issues and everything works.

MarcusLCC commented 3 years ago

Thank you Gavin, this is so helpful.

I have re-installed and tried again, it runs perfectly well. I will try to interpret and utilise the result.

Cheers, Marcus

GWW commented 3 years ago

No problem. I am happy to have helped. One quick note the pileup command has a default allele fraction cutoff of 0.01 (this will find a lot of false positives and junk) and you can increase it using the --min-af flag; for example, --min-af 0.25. In the paper I found 0.25 to be good for single genotype sample types.