Closed MarcusLCC closed 3 years ago
Hi Marcus,
I think the 5'-v1 libraries you have are similar to 5'-v2 libraries (the difference is that the v2 are dual indexed instead of single indexed). I suspect the 5P_V2 option you specified is correct as the tool would have had some issues matching barcodes otherwise.
I should have clarified the documentation a bit but in order to run the snvcount method you need "annotate" the raw pileup file emitted by the pileup command. The annotate command applies some light filtering and basic SNV calling as the pileup command only provides raw base counts.
There is documentation on getting the annotation files and running the snvcount in the README. The additional files such as repeatmasker, REDIPortal and 1000 genomes are optional but can be helpful. I'll update the instructions to mention the annotate command requirement.
Thank you Gavin.
I'm trying to run the annotation currently. For the repeatmasker and REDIPortal files, I'm not sure whether I downloaded the correct one. Could you please provide some information about where I should get the file?
For example, I downloaded the repeatmasker file from https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/
. Am I supposed to download the hg38.fa.out.gz
which is 'RepeatMasker .out file' for use in the annotation step? And for the REDIportal file, I went to the REDIportal website download page http://srv00.recas.ba.infn.it/atlas/download.html
and select hg38. There's a file named TABLE1_hg38.txt.gz
. Is this file supposed to be used as the REDIportal file in the annotation step?
Many thanks for your help.
Regards, Marcus
Hi Marcus,
Apologies for the delay. Those should be the files if you used Ensembl for your reference the chromosomes are likely in the format 1,2,3 etc. The scsnvmisc annotate tool expects this and automatically strips the chr from the repeat masker and RediPortal file. If this gives you any issues I will adjust the tool to not do that and instead require the chromosomes be correctly named in the files.
My files have the following format for reference:
Repeat Masker:
#bin swScore milliDiv milliDel milliIns genoName genoStart genoEnd genoLeft strand repName repClass repFamily repStart repEnd repLeft id
0 1892 83 59 14 chr1 67108753 67109046 -181847376 + L1P5 LINE L1 5301 5607 -544 1
1 2582 27 0 23 chr1 8388315 8388618 -240567804 - AluY SINE Alu -15 296 1 1
1 4085 171 77 36 chr1 25165803 25166380 -223790042 + L1MB5 LINE L1 5567 6174 0 4
1 2285 91 0 13 chr1 33554185 33554483 -215401939 - AluSc SINE Alu -6 303 10 6
1 2451 64 3 26 chr1 41942894 41943205 -207013217 - AluY SINE Alu -7 304 1 8
1 1587 272 100 49 chr1 50331336 50332274 -198624148 + HAL1 LINE L1 773 1763 -744 9
1 1393 280 82 51 chr1 58719764 58720546 -190235876 + L2a LINE L2 2582 3418 -8 1
2 5372 165 14 27 chr1 75496057 75497775 -173458647 + L1MA9 LINE L1 5168 6868 -30 1
2 536 349 146 56 chr1 92274205 92275925 -156680497 + L2 LINE L2 406 2306 -1113 1
REDIPortal
chr1 10186 10187 A_G_+
chr1 10192 10193 A_G_+
chr1 10210 10211 A_G_+
chr1 10216 10217 A_G_+
chr1 10222 10223 A_G_+
chr1 10228 10229 A_G_+
chr1 10235 10236 A_G_+
chr1 10241 10242 A_G_+
chr1 10248 10249 A_G_+
chr1 10254 10255 A_G_+
Hi Gavin,
Thank you for your help. I'm still a bit confused where to get the files which are in the same format as yours, as I dived into the REDIportal and UCSC database but still didn't find the corresponding files. Sorry I'm not very experienced in using these files.
Could you please provide the link for downloading the RepeatMasker and REDIportal files? Thanks.
For your information, for bwa index generation and gtf file required in other steps, I downloaded the files from Gencode (GRCh38.p13 genome reference and release 36 gtf annotation).
Regards, Marcus
Hi Marcus,
I have updated the documentation to better reflect the two database files as well as the potential for chromosome name mismatches.
I removed some code that was force converting chromosome names as that would cause issues with gencode annotations (I believe they use the chr prefix). In case your annotations do not have the chromosome names would need to be mapped to your annotation file. You can get mapping files from here if they are needed. you would need to map the first column from the REDIPortal file (the chromosome name) using these files or the 5th column from the UCSC table browser file (again the chromosome name).
I downloaded the RepeatMasker table from the UCSC table browser here.
It looks like the REDIPortal annotations may have changed at some point. I remember they didn't previously provide hg38 locations and I had to lift it over. I have pushed an update to support the downloaded database file directly.
You'll need to pull the changes from the repository and reinstall the scsnvpy module
cd scsnvpy
python setup.py install
For reference these are the contents of the two files:
Repeat masker from the UCSC table browser rmsk table
#bin swScore milliDiv milliDel milliIns genoName genoStart genoEnd genoLeft strand repName repClass repFamily repStart repEnd repLeft id
0 1892 83 59 14 chr1 67108753 67109046 -181847376 + L1P5 LINE L1 5301 5607 -544 1
1 2582 27 0 23 chr1 8388315 8388618 -240567804 - AluY SINE Alu -15 296 1 1
1 4085 171 77 36 chr1 25165803 25166380 -223790042 + L1MB5 LINE L1 5567 6174 0 4
1 2285 91 0 13 chr1 33554185 33554483 -215401939 - AluSc SINE Alu -6 303 10 6
1 2451 64 3 26 chr1 41942894 41943205 -207013217 - AluY SINE Alu -7 304 1 8
1 1587 272 100 49 chr1 50331336 50332274 -198624148 + HAL1 LINE L1 773 1763 -744 9
1 1393 280 82 51 chr1 58719764 58720546 -190235876 + L2a LINE L2 2582 3418 -8 1
2 5372 165 14 27 chr1 75496057 75497775 -173458647 + L1MA9 LINE L1 5168 6868 -30 1
2 536 349 146 56 chr1 92274205 92275925 -156680497 + L2 LINE L2 406 2306 -1113 1
The REDIPortal database:
Region Position Ref Ed Strand db type dbsnp repeat Func.wgEncodeGencodeBasicV34 Gene.wgEncodeGencodeBasicV34 GeneDetail.wgEncodeGencodeBasicV34 ExonicFunc.wgEncodeGencodeBasicV34 AAChange.wgEncodeGencodeBasicV34 Func.refGene Gene.refGene GeneDetail.refGene ExonicFunc.refGene AAChange.refGene Func.knownGene Gene.knownGene GeneDetail.knownGene ExonicFunc.knownGene AAChange.knownGene phastConsElements100way
chr1 87158 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
chr1 87168 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
chr1 87171 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
chr1 87189 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
chr1 87218 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
chr1 87225 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
chr1 87231 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
chr1 87242 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
chr1 87248 T C - A ALU - SINE/AluJo intergenic OR4F5;AL627309.1 - - intergenic OR4F5;LOC729737 - - intergenic OR4F5;AL627309.1 - - -
Apologies again for all the confusion hopefully this sorts out the annotation issues and everything works.
Thank you Gavin, this is so helpful.
I have re-installed and tried again, it runs perfectly well. I will try to interpret and utilise the result.
Cheers, Marcus
No problem. I am happy to have helped. One quick note the pileup command has a default allele fraction cutoff of 0.01 (this will find a lot of false positives and junk) and you can increase it using the --min-af
flag; for example, --min-af 0.25
. In the paper I found 0.25 to be good for single genotype sample types.
Hi Gavin,
I've been trying to run through the pipeline on the server these days using scRNAseq data. I'm rather new to single cell analysis, and there're some issues that I encountered and couldn't figure out.
And the fastq files for each sample actually include two files (R1&R2). Here I attach the first several lines of the fastq file for reference.
sample1_S29_L003_R1_001.fastq.gz
sample1_S29_L003_R2_001.fastq.gz
May I have your suggestion about the library type to choose?
And the files under working directory ${scSNV_workdir} include:
The files under the output directory ${scSNV_workdir}/out include:
, which does not contain the 'pileup_passed_snvs.txt.gz' which is required for the
scsnv snvcounts
Could you please tell me what I've done wrong here? Many thanks,
Regards, Marcus