GWW / scsnv

scSNV Mapping tool for 10X Single Cell Data
MIT License
22 stars 4 forks source link

error in scsnvmisc cells #29

Closed monoplasty closed 1 year ago

monoplasty commented 1 year ago

Hi,

Thank you for your amazing tool!

I encountered the error when I did scsnvmisc cells. The code and error is bellow.

(scsnv) [root@scRNA scsnv]# scsnvmisc cells -o sample_data/ sample_data/map/mapsummary.h5
Error {key} not in the barcode data columns

(scsnv) [root@scRNA scsnv]# scsnvmisc cells --skip-mt -o sample_data/ sample_data/map/mapsummary.h5
Skipping MT DNA filtering
Total Passed Cells 164
(19566, 164) (19566, 164)
/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/flammkuchen-1.0.2-py3.8.egg/flammkuchen/hdf5io.py:302: PerformanceWarning: 
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->Index(['gene_id', 'gene_name'], dtype='object')]

  store.put(group._v_pathname + "/" + name, level)

(scsnv) [root@scRNA scsnv]# h5ls sample_data/map/mapsummary.h5
barcode_ids              Dataset {182032}
barcode_rates            Group
barcodes                 Dataset {182032}
exonic                   Group
gene_ids                 Dataset {21188}
gene_names               Dataset {21188}
intronic                 Group
(scsnv) [root@scRNA scsnv]# h5ls sample_data/map/mapsummary.h5/barcode_rates
align_ambiguous          Dataset {182032}
align_antisense          Dataset {182032}
align_barcode_corrected  Dataset {182032}
align_cdna               Dataset {182032}
align_intergenic         Dataset {182032}
align_intronic           Dataset {182032}
align_multimapped        Dataset {182032}
align_tag_qa_fail        Dataset {182032}
align_total_reads        Dataset {182032}
align_umi_qa_fail        Dataset {182032}
align_unmapped           Dataset {182032}
discarded_reads          Dataset {182032}
field_order              Dataset {19}
genes                    Dataset {182032}
intron_molecules         Dataset {182032}
intron_pcr_dups          Dataset {182032}
intron_reads             Dataset {182032}
molecules                Dataset {182032}
pcr_dups                 Dataset {182032}
reads                    Dataset {182032}

Can I know where I made a mistake through the information I provided? Looking forward to your reply.

GWW commented 1 year ago

Hi,

It looks like your second run with the --skip-mt option worked. It just issued a warning about strings in the hdf5 file.

monoplasty commented 1 year ago

Thank you Gavin.

I'm trying to run the annotation currently. For the repeatmasker and REDIPortal files, I downloaded the repeatmasker file from https://hgdownload.cse.ucsc.edu/goldenpath/hg38/bigZips/ and get hg38.fa.out.gz file.

The complete execution code is as follows:

(scsnv) [root@scRNA ref_gene]# zless hg38.fa.out.gz | head
   SW  perc perc perc  query      position in query           matching       repeat              position in  repeat
score  div. del. ins.  sequence    begin     end    (left)    repeat         class/family         begin  end (left)   ID

  463   1.3  0.6  1.7  chr1        10001   10468 (248945954) +  (TAACCC)n      Simple_repeat            1  471    (0)      1
 3612  11.4 21.5  1.3  chr1        10469   11447 (248944975) C  TAR1           Satellite/telo       (399) 1712    483      2
  484  25.1 13.2  0.0  chr1        11505   11675 (248944747) C  L1MC5a         LINE/L1             (2382)  395    199      3
  239  29.4  1.9  1.0  chr1        11678   11780 (248944642) C  MER5B          DNA/hAT-Charlie       (74)  104      1      4
  318  23.0  3.7  0.0  chr1        15265   15355 (248941067) C  MIR3           SINE/MIR             (119)  143     49      5
   18  23.2  0.0  1.9  chr1        15798   15849 (248940573) +  (TGCTCC)n      Simple_repeat            1   52    (0)      6
   18  13.7  0.0  0.0  chr1        16713   16744 (248939678) +  (TGG)n         Simple_repeat            1   32    (0)      7

(scsnv) [root@scRNA ref_gene]# zless REDIportal_hg38.txt.gz | head
Region  Position        Ref     Ed      Strand  db      type    dbsnp   repeat  Func.wgEncodeGencodeBasicV34    Gene.wgEncodeGencodeBasicV34    GeneDetail.wgEncodeGencodeBasicV34 ExonicFunc.wgEncodeGencodeBasicV34      AAChange.wgEncodeGencodeBasicV34        Func.refGene    Gene.refGene    GeneDetail.refGene      ExonicFunc.refGene AAChange.refGene        Func.knownGene  Gene.knownGene  GeneDetail.knownGene    ExonicFunc.knownGene    AAChange.knownGene      phastConsElements100way
chr1    87158   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737    -       -       intergenic      OR4F5;AL627309.1        -       -       -
chr1    87168   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737    -       -       intergenic      OR4F5;AL627309.1        -       -       -
chr1    87171   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737    -       -       intergenic      OR4F5;AL627309.1        -       -       -
chr1    87189   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737    -       -       intergenic      OR4F5;AL627309.1        -       -       -
chr1    87218   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737    -       -       intergenic      OR4F5;AL627309.1        -       -       -
chr1    87225   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737    -       -       intergenic      OR4F5;AL627309.1        -       -       -
chr1    87231   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737    -       -       intergenic      OR4F5;AL627309.1        -       -       -
chr1    87242   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737    -       -       intergenic      OR4F5;AL627309.1        -       -       -
chr1    87248   T       C       -       A       ALU     -       SINE/AluJo      intergenic      OR4F5;AL627309.1        -       -       intergenic      OR4F5;LOC729737    -       -       intergenic      OR4F5;AL627309.1        -       -       -

(scsnv) [root@scRNA ref_gene]# wget ftp://ftp.ensembl.org/pub/release-99/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz
(scsnv) [root@scRNA ref_gene]# wget ftp://ftp.ensembl.org/pub/release-99/variation/vcf/homo_sapiens/1000GENOMES-phase_3.vcf.gz.csi
(scsnv) [root@scRNA ref_gene]# wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz
(scsnv) [root@scRNA ref_gene]# wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz.tbi

(scsnv) [root@scRNA ref_gene]# bcftools view -m2 -M2 -v snps -i "MAF>=0.01" -o 1000GENOMES-filtered.vcf.gz -O z 1000GENOMES-phase_3.vcf.gz
[W::bcf_hrec_check] Invalid tag name: "HGMD-PUBLIC_20184"
[W::vcf_parse_info] INFO 'AFR' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'AMR' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'EAS' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'EUR' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'SAS' is not defined in the header, assuming Type=String

(scsnv) [root@scRNA ref_gene]# bcftools +fill-tags ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz -- -t AF | bcftools view -m2 -M2 -v snps -i "MAF>=0.01" -o ALL.chrMT-filtered.vcf.gz

(scsnv) [root@scRNA ref_gene]# zgrep --no-filename -E "^(#)" -v ALL.chrMT-filtered.vcf.gz | gzip > ALL.chrMT-filtered-no-header.vcf.gz
gzip: -v.gz: No such file or directory

(scsnv) [root@scRNA ref_gene]# zgrep --no-filename -E "^(#)\1+" -v 1000GENOMES-filtered.vcf.gz ALL.chrMT-filtered-no-header.vcf.gz | cut -f 1,2,4,5 | gzip > 1000GENOMES.txt.gz
gzip: -v.gz: No such file or directory

(scsnv) [root@scRNA ref_gene]# ls -lh
total 2.1G
-rw-r--r-- 1 root root 242M Jun  7 16:10 1000GENOMES-filtered.vcf.gz
-rw-r--r-- 1 root root 1.6G Jun  7 15:54 1000GENOMES-phase_3.vcf.gz
-rw-r--r-- 1 root root 1.8M Jun  7 15:26 1000GENOMES-phase_3.vcf.gz.csi
-rw-r--r-- 1 root root 1.6K Jun  7 16:49 1000GENOMES.txt.gz
-rw-r--r-- 1 root root 6.2K Jun  7 16:23 ALL.chrMT-filtered-no-header.vcf.gz
-rw-r--r-- 1 root root  55K Jun  7 16:12 ALL.chrMT-filtered.vcf.gz
-rw-r--r-- 1 root root 202K Jun  7 15:26 ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz
-rw-r--r-- 1 root root  119 Jun  7 15:26 ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz.tbi
-rw-r--r-- 1 root root  11K Jun  7 15:20 GRCh38_UCSC2ensembl.txt
-rw-r--r-- 1 root root 173M Jun  7 15:01 hg38.fa.out.gz
drwxr-xr-x 2 root root 4.0K Jun  5 18:46 index
-rw-r--r-- 1 root root  87M Jun  7 10:37 REDIportal_hg38.txt.gz

(scsnv) [root@scRNA ref_gene]# scsnvmisc annotate -r ref_gene/hg38.fa.out.gz -d ref_gene/1000GENOMES.txt.gz -e ref_gene/REDIportal_hg38.txt.gz sample_data/pileup/pileup
Processing sample_data/pileup/pileup
Kept 11361 out of 11641
[ True  True  True ...  True  True  True] (11641,) (11641, 164) (11641, 164)
Parsing 1000 Genomes file
Traceback (most recent call last):
  File "/usr/local/miniconda3/envs/scsnv/bin/scsnvmisc", line 4, in <module>
    __import__('pkg_resources').run_script('scsnvpy==1.0', 'scsnvmisc')
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pkg_resources/__init__.py", line 720, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1559, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/EGG-INFO/scripts/scsnvmisc", line 37, in <module>
    cmds.COMMANDS[cmd](args)
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/scsnvpy/annotate.py", line 61, in annotate_cmd
    aux.annotate_g1000(ann._snvs, args.g1000)
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/scsnvpy/annaux.py", line 93, in annotate_g1000
    df = pd.read_csv(fg1000, sep="\t", dtype={'chrom':object}, names=['chrom', 'pos', 'ref', 'alt'], header=0)
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pandas-2.0.2-py3.8-linux-x86_64.egg/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pandas-2.0.2-py3.8-linux-x86_64.egg/pandas/io/parsers/readers.py", line 583, in _read
    return parser.read(nrows)
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pandas-2.0.2-py3.8-linux-x86_64.egg/pandas/io/parsers/readers.py", line 1704, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pandas-2.0.2-py3.8-linux-x86_64.egg/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 812, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 889, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._convert_column_data
pandas.errors.ParserError: Too many columns specified: expected 4 and found 1

When I run the following command, it works fine.

(scsnv) [root@scRNA scsnv]# scsnvmisc annotate -e ref_gene/REDIportal_hg38.txt.gz sample_data/pileup/pileup
Processing sample_data/pileup/pileup
Kept 11361 out of 11641
[ True  True  True ...  True  True  True] (11641,) (11641, 164) (11641, 164)
Annotating edits
REDIportal 202

(scsnv) [root@scRNA scsnv]# ls -lh sample_data/pileup
total 59M
-rw-r--r-- 1 root root 4.4M Jun  7 17:03 pileup_annotated.h5
-rw-r--r-- 1 root root  54M Jun  7 10:08 pileup_barcode_matrices.h5
-rw-r--r-- 1 root root 3.0K Jun  7 10:08 pileup_barcodes.txt.gz
-rw-r--r-- 1 root root  251 Jun  7 17:10 pileup_filtering.txt
-rw-r--r-- 1 root root 244K Jun  7 17:03 pileup_passed_snvs.txt.gz
-rw-r--r-- 1 root root 247K Jun  7 10:08 pileup.txt.gz

Can the result obtained in this way be used for the scsnv snvcounts command?

Hope to get your reply.

I'm just starting to learn about transcriptomes. So there will be a lot of questions, thank you very much for your patient reply, thank you again!

GWW commented 1 year ago
(scsnv) [root@scRNA ref_gene]# zgrep --no-filename -E "^(#)" -v ALL.chrMT-filtered.vcf.gz | gzip > ALL.chrMT-filtered-no-header.vcf.gz
gzip: -v.gz: No such file or directory

It seems like there's something wrong with your environment as each zgrep commands is giving a strange error. As an alternative you can try:

bcftools concat ALL.chrMT-filtered.vcf.gz 1000GENOMES-filtered.vcf.gz | bcftools view --no-header | cut -f 1,2,4,5 | gzip > 1000GENOMES.txt.gz

Let me know if it works and I will replace the documentation with this.

GWW commented 1 year ago

On second thought you can just skip downloaded the MT 1000 genomes file and just use:

bcftools view --no-header -m2 -M2 -v snps -i "MAF>=0.01" 1000GENOMES-phase_3.vcf.gz | cut -f 1,2,4,5 | gzip > 1000GENOMES.txt.gz
monoplasty commented 1 year ago

First I run this command:

(scsnv) [root@scRNA ref_gene]# bcftools concat ALL.chrMT-filtered.vcf.gz 1000GENOMES-filtered.vcf.gz | bcftools view --no-header | cut -f 1,2,4,5 | gzip > 1000GENOMES.txt.gz
Checking the headers and starting positions of 2 files
[W::bcf_hrec_check] Invalid tag name: "HGMD-PUBLIC_20184"
[W::bcf_hrec_check] Invalid tag name: "HGMD-PUBLIC_20184"
Different number of samples in 1000GENOMES-filtered.vcf.gz. Perhaps "bcftools merge" is what you are looking for?
Failed to read from standard input: unknown file type

Then I run the following command:

(scsnv) [root@scRNA ref_gene]# bcftools view --no-header -m2 -M2 -v snps -i "MAF>=0.01" 1000GENOMES-phase_3.vcf.gz | cut -f 1,2,4,5 | gzip > 1000GENOMES.txt.gz
[W::bcf_hrec_check] Invalid tag name: "HGMD-PUBLIC_20184"
[W::vcf_parse_info] INFO 'AFR' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'AMR' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'EAS' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'EUR' is not defined in the header, assuming Type=String
[W::vcf_parse_info] INFO 'SAS' is not defined in the header, assuming Type=String

(scsnv) [root@scRNA ref_gene]# ls -lh
total 2.1G
-rw-r--r-- 1 root root 242M Jun  7 16:10 1000GENOMES-filtered.vcf.gz
-rw-r--r-- 1 root root 1.6G Jun  7 15:54 1000GENOMES-phase_3.vcf.gz
-rw-r--r-- 1 root root 1.8M Jun  7 15:26 1000GENOMES-phase_3.vcf.gz.csi
-rw-r--r-- 1 root root  39M Jun  8 09:11 1000GENOMES.txt.gz
-rw-r--r-- 1 root root 6.2K Jun  7 16:23 ALL.chrMT-filtered-no-header.vcf.gz
-rw-r--r-- 1 root root  55K Jun  7 16:12 ALL.chrMT-filtered.vcf.gz
-rw-r--r-- 1 root root 202K Jun  7 15:26 ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz
-rw-r--r-- 1 root root  119 Jun  7 15:26 ALL.chrMT.phase3_callmom-v0_4.20130502.genotypes.vcf.gz.tbi
-rw-r--r-- 1 root root  11K Jun  7 15:20 GRCh38_UCSC2ensembl.txt
-rw-r--r-- 1 root root 173M Jun  7 15:01 hg38.fa.out.gz
drwxr-xr-x 2 root root 4.0K Jun  5 18:46 index
-rw-r--r-- 1 root root  87M Jun  7 10:37 REDIportal_hg38.txt.gz

(scsnv) [root@scRNA ref_gene]# zcat 1000GENOMES.txt.gz | head
1       11012   C       G
1       13110   G       A
1       13116   T       G
1       13118   A       G
1       13273   G       C
1       14464   A       T
1       14930   A       G
1       14933   G       A
1       15774   G       A
1       15777   A       G

(scsnv) [root@scRNA scsnv]# scsnvmisc annotate -r ref_gene/hg38.fa.out.gz -d ref_gene/1000GENOMES.txt.gz -e ref_gene/REDIportal_hg38.txt.gz sample_data/pileup/pileup
Processing sample_data/pileup/pileup
Kept 11361 out of 11641
[ True  True  True ...  True  True  True] (11641,) (11641, 164) (11641, 164)
Parsing 1000 Genomes file
Annotating the SNVs
Found in 1000 Genomes: 0
Annotating repeats
Traceback (most recent call last):
  File "/usr/local/miniconda3/envs/scsnv/bin/scsnvmisc", line 4, in <module>
    __import__('pkg_resources').run_script('scsnvpy==1.0', 'scsnvmisc')
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pkg_resources/__init__.py", line 720, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pkg_resources/__init__.py", line 1559, in run_script
    exec(code, namespace, namespace)
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/EGG-INFO/scripts/scsnvmisc", line 37, in <module>
    cmds.COMMANDS[cmd](args)
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/scsnvpy/annotate.py", line 62, in annotate_cmd
    aux.annotate_repeats(ann._snvs, args.repeats)
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/scsnvpy-1.0-py3.8-linux-x86_64.egg/scsnvpy/annaux.py", line 151, in annotate_repeats
    df = pd.read_csv(frep, sep="\t", skiprows=1, header=None, usecols=[5,6,7,9,10,11,12], 
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pandas-2.0.2-py3.8-linux-x86_64.egg/pandas/io/parsers/readers.py", line 912, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pandas-2.0.2-py3.8-linux-x86_64.egg/pandas/io/parsers/readers.py", line 583, in _read
    return parser.read(nrows)
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pandas-2.0.2-py3.8-linux-x86_64.egg/pandas/io/parsers/readers.py", line 1704, in read
    ) = self._engine.read(  # type: ignore[attr-defined]
  File "/usr/local/miniconda3/envs/scsnv/lib/python3.8/site-packages/pandas-2.0.2-py3.8-linux-x86_64.egg/pandas/io/parsers/c_parser_wrapper.py", line 234, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas/_libs/parsers.pyx", line 812, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas/_libs/parsers.pyx", line 889, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 951, in pandas._libs.parsers.TextReader._convert_column_data
pandas.errors.ParserError: Too many columns specified: expected 7 and found 1

Sorry to keep asking you such stupid questions, and thank you so much for your patience! sincere thanks!

GWW commented 1 year ago

Hi,

The repeatmasker table you downloaded is the wrong table. You want to download the whole table from here.

The other thing to check is to make sure the chromosomes in your 1000 genomes text file match the chromosome style you used for cell ranger. Ie. chr1 or 1 etc.

monoplasty commented 1 year ago

Thanks a lot for your thoughtful reply! all the best