10XGenomics / cellranger

10x Genomics Single Cell Analysis
https://www.10xgenomics.com/support/software/cell-ranger
Other
356 stars 92 forks source link

mkref error #125

Closed pavsol closed 3 years ago

pavsol commented 3 years ago

Hi,

I am having an issue with preparing reference with mkref. See the command and error message:

$ ~/tools/cellranger-6.0.1/bin/cellranger mkref --genome=arabidopsis_genes_lncRNA --fasta=~/references/arabidopsis/Arabidopsis_thaliana.TAIR10.Chr.fa --genes=../Araport11_GTF_genes_lncRNA.Mar202021.filtered.gtf --nthreads=6
['/home/pavsol/tools/cellranger-6.0.1/bin/rna/mkref', '--genome=arabidopsis_genes_lncRNA', '--fasta=~/references/arabidopsis/Arabidopsis_thaliana.TAIR10.Chr.fa', '--genes=../Araport11_GTF_genes_lncRNA.Mar202021.filtered.gtf', '--nthreads=6']
Creating new reference folder at /home/pavsol/scRNAseq_pilot/arabidopsis/cellranger/arabidopsis_genes_lncRNA
...done

Writing genome FASTA file into reference folder...
...done

Indexing genome FASTA file...
...done

Writing genes GTF file into reference folder...
...done

Traceback (most recent call last):
  File "/home/pavsol/tools/cellranger-6.0.1/lib/python/cellranger/reference.py", line 750, in validate_gtf
    subprocess.check_output(cmd, stderr=subprocess.STDOUT)
  File "/home/pavsol/tools/cellranger-6.0.1/external/anaconda/lib/python3.7/subprocess.py", line 411, in check_output
    **kwargs).stdout
  File "/home/pavsol/tools/cellranger-6.0.1/external/anaconda/lib/python3.7/subprocess.py", line 512, in run
    output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['gtf_to_gene_index', '/home/pavsol/scRNAseq_pilot/arabidopsis/cellranger/arabidopsis_genes_lnc
RNA', '/home/pavsol/scRNAseq_pilot/arabidopsis/cellranger/arabidopsis_genes_lncRNA/tmp0hjg1w95.json']' returned non-zero exit status 1.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/pavsol/tools/cellranger-6.0.1/bin/rna/mkref", line 139, in <module>
    main()
  File "/home/pavsol/tools/cellranger-6.0.1/bin/rna/mkref", line 130, in main
    reference_builder.build_gex_reference()
  File "/home/pavsol/tools/cellranger-6.0.1/lib/python/cellranger/reference.py", line 613, in build_gex_reference
    self.validate_gtf()
  File "/home/pavsol/tools/cellranger-6.0.1/lib/python/cellranger/reference.py", line 753, in validate_gtf
    raise GexReferenceError("Error detected in GTF file: " + exc.output) from exc
TypeError: can only concatenate str (not "bytes") to str 

I am using an annotation for Arabidopsis thaliana downloaded from arabidopsis.org ( Araport11_GTF_genes_transposons.Mar202021.gtf.gz) which was further restricted to keep only CDS, exon, 5UTR, 3UTR, gene, lncRNA and mRNA:

$ awk '$3 == "CDS" || $3 == "exon" || $3 == "five_prime_UTR" || $3 == "gene" || $3 == "lnc_RNA" || $3 == "mRNA" || $3 == "three_prime_UTR" {print $0}' 

First few lines of my GTF:

Chr1    Araport11       gene    3631    5899    .       +       .       transcript_id "AT1G01010"; gene_id "AT1G01010";
Chr1    Araport11       mRNA    3631    5899    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       five_prime_UTR  3631    3759    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       exon    3631    3913    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       CDS     3760    3913    .       +       0       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       exon    3996    4276    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       CDS     3996    4276    .       +       2       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       exon    4486    4605    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       CDS     4486    4605    .       +       0       transcript_id "AT1G01010.1"; gene_id "AT1G01010";
Chr1    Araport11       exon    4706    5095    .       +       .       transcript_id "AT1G01010.1"; gene_id "AT1G01010";

Cellranger version 6.0.1

I am obviously not the only one having this issue: https://stackoverflow.com/questions/67706086/cellranger-how-to-convert-a-gtf-file-to-string?newreg=1ed75ab3d056488eae21facdbc36035f

Any idea what is going wrong?

Thank you! Pavel

evolvedmicrobe commented 3 years ago

Hi Pavel,

It looks like it's not printing the error message due to a Python2 vs. Python 3 issue which we'll need to fix. To view the actual error message, would you mind running the program directly and posting its output here? The command to do so is:

cellranger-6.0.1/lib/bin/gtf_to_gene_index home/pavsol/scRNAseq_pilot/arabidopsis/cellranger/arabidopsis_genes_lnc test.json
pavsol commented 3 years ago

Thank you for a quick answer. Here it is:

$ ~/tools/cellranger-6.0.1/lib/bin/gtf_to_gene_index /home/pavsol/scRNAseq_pilot/arabidopsis/cellranger/arabidopsis_genes_lncRNA test.json
error: Duplicate Gene ID found in GTF: ATMG01275

So the issue can be incorrect ID in my GTF. I will remove it and try again.

pavsol commented 3 years ago

$ >>> Reference successfully created! <<<

Simple removing those duplicated IDs solved the issue. Thank you for your help :)

evolvedmicrobe commented 3 years ago

Great! Glad to hear it was resolved, the next version of Cell Ranger will print the error message directly and avoid this happening again, thank you for reporting this.

ericminikel commented 2 years ago

Apologies for commenting on a year-old closed thread, but this is still the second Google hit for "Error detected in GTF file:" so others may find this as well.

Even with Python 3, I am finding that the error message is blank so it is impossible to figure out what is wrong. I am using Ensembl reference so supposedly GTFs should already be well-formatted.

# https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/tutorial_mr
# https://useast.ensembl.org/Macaca_fascicularis/Info/Index
wget http://ftp.ensembl.org/pub/release-107/fasta/macaca_fascicularis/dna/Macaca_fascicularis.Macaca_fascicularis_6.0.dna.toplevel.fa.gz
gunzip Macaca_fascicularis.Macaca_fascicularis_6.0.dna.toplevel.fa.gz
wget http://ftp.ensembl.org/pub/release-107/gtf/macaca_fascicularis/Macaca_fascicularis.Macaca_fascicularis_6.0.107.gtf.gz
gunzip Macaca_fascicularis.Macaca_fascicularis_6.0.107.gtf.gz
# use -la cellranger
use .cellranger-7.0.0
cellranger mkgtf \
  Macaca_fascicularis.Macaca_fascicularis_6.0.107.gtf Macaca_fascicularis.Macaca_fascicularis_6.0.107.filtered.gtf \
  --attribute=gene_biotype:protein_coding \
  --attribute=gene_biotype:lncRNA

use .python-3.9.2
cellranger mkref \
  --genome=Macaca_fascicularis_6.0 \
  --fasta=Macaca_fascicularis.Macaca_fascicularis_6.0.dna.toplevel.fa \
  --genes=Macaca_fascicularis.Macaca_fascicularis_6.0.107.filtered.gtf \
  --ref-version=1.0.0

Output:

Creating new reference folder at /broad/prions/ono/cyno/Macaca_fascicularis_6.0
...done

Writing genome FASTA file into reference folder...
...done

Indexing genome FASTA file...
...done

Writing genes GTF file into reference folder...
...done

mkref has failed: error building reference package
Error detected in GTF file:

Any ideas what to try next? I tried using an unfiltered version of the reference and it failed just the same. Thanks in advance.