bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
979 stars 355 forks source link

RefSeq GRCh38.p12 reference genome fails to build #2760

Closed mjsteinbaugh closed 4 years ago

mjsteinbaugh commented 5 years ago

I'm hitting some errors when attempting to index the current RefSeq Homo sapiens reference genome GCF_000001405.38_GRCh38.p12 using bcbio_setup_genome.py. Is RefSeq currently supported in bcbio?

Here's a link to the relevant files on the NCBI FTP server: ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.38_GRCh38.p12

Other genomes from Ensembl/GenBank and GENCODE (e.g. GRCh38 v29) install as expected on my virtual machine, so it appears that I'm running into some RefSeq-specific issues.

bowtie2, for example, errors out here, and I see this using either the GTF or GFF3 as input:

Total time for backward call to driver() for mirror index: 00:34:05
2019-04-05 16:14:23,509 - INFO - Committing changes: 3698000 features
2019-04-05 16:14:24,047 - INFO - Populating features table and first-order relations: 3698296 features
2019-04-05 16:14:24,047 - INFO - Creating relations(parent) index
2019-04-05 16:14:28,948 - INFO - Creating relations(child) index
2019-04-05 16:14:34,035 - INFO - Inferring transcript extents and writing to tempfile
2019-04-05 16:14:59,707 - INFO - Importing inferred features into db
2019-04-05 16:15:27,587 - INFO - Committing changes
2019-04-05 16:15:27,704 - INFO - Creating relations(parent) index
2019-04-05 16:15:32,459 - INFO - Creating relations(child) index
2019-04-05 16:15:37,302 - INFO - Creating features(featuretype) index
2019-04-05 16:15:40,852 - INFO - Creating features (seqid, start, end) index
2019-04-05 16:15:45,800 - INFO - Creating features (seqid, start, end, strand) index
2019-04-05 16:15:50,957 - INFO - Running ANALYSE features
invalid gffGroup detected on line: NC_000002.12 Curated Genomic stop_codon      88857361
        88857363        0.000000        -       0       gene_id "IGKC"; transcript_id "unknown_transcript_1";
GFF/GTF group unknown_transcript_1 on NC_000001.11-, this line is on NC_000002.12-, all group members must be on same seq and strand
1 errors
Creating gffutils database for /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/tmpcbl/ref-transcripts.gtf.
'transcript' or 'gene' entries not found, so inferring their extent. This can be very slow.
Writing out merged GTF file to /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/tmpcbl/ref-transcripts.gtf.
Traceback (most recent call last):
  File "/data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py", line 840, in <module>
    main(args.org_build, args.gtf, args.fasta, genome_dir, args.cores, args)
  File "/data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py", line 298, in main
    gtf_to_refflat(gtf_file)
  File "/data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py", line 439, in gtf_to_refflat
    genepred = gtf_to_genepred(gtf)
  File "/data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py", line 431, in gtf_to_genepred
    subprocess.check_call(cmd.format(**locals()), shell=True)
  File "/data00/bcbio/stable/anaconda/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'gtfToGenePred -allErrors -ignoreGroupsWithoutExons -genePredExt /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/tmpcbl/ref-transcripts.gtf /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/tmpcbl/ref-transcripts.genePred' returned non-zero exit status 255
Creating directories using /data00/bcbio/stable/genomes as the base.
Genomes will be installed into /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12.
Installed genome as /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/seq/GCF_000001405.38_GRCh38.p12.fa.
Installed GTF as /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/rnaseq/ref-transcripts.gtf.
Creating the seq index.
[localhost] local: samtools faidx GCF_000001405.38_GRCh38.p12.fa
[localhost] local: picard -Xms500m -Xmx3500m CreateSequenceDictionary REFERENCE=/data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/seq/GCF_000001405.38_GRCh38.p12.fa OUTPUT=/data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/seq/GCF_000001405.38_GRCh38.p12.dict
Creating the bowtie2 index.
[localhost] local: mkdir /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/seq/../bowtie2
[localhost] local: ln -sf ../seq/GCF_000001405.38_GRCh38.p12.fa /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/bowtie2/GCF_000001405.38_GRCh38.p12.fa
Traceback (most recent call last):
  File "/data00/bcbio/tools/bin/bcbio_setup_genome.py", line 312, in <module>
    subprocess.check_call(cmd.format(**locals()), shell=True)
  File "/data00/bcbio/stable/anaconda/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/data00/bcbio/stable/anaconda/bin/python /data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py --cores 28 --genome-dir /data00/bcbio/stable/genomes --gtf /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/rnaseq/ref-transcripts.gtf Hsapiens GCF_000001405.38_GRCh38.p12' returned non-zero exit status 1
roryk commented 5 years ago

Hi Mike,

It looks like it is throwing up on the stop_codon lines in the GTF file-- if you remove those and try to rerun, does it work? If so I can add stop_codon to the list of things we filter out from the GTF before going on with it.

mjsteinbaugh commented 5 years ago

Thanks, I'll try editing the GTF file to remove the stop_codon line and see if I can get it to index clean with bowtie2 directly first.

roryk commented 5 years ago

I think it is this that is failing: gtfToGenePred -allErrors -ignoreGroupsWithoutExons -genePredExt /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/tmpcbl/ref-transcripts.gtf /data00/bcbio/stable/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/tmpcbl/ref-transcripts.genePred -- hopefully a clean version of the GTF file will work.

roryk commented 5 years ago

I'm guessing the start_codon entries will have the same problem so you probably want to remove those too.

mjsteinbaugh commented 5 years ago

Thanks yeah I'll see if I can make a stripped down GTF file that works with gtfToGenePred.

mjsteinbaugh commented 5 years ago

Here's a minimal example that fixes gtfToGenePred on my end:

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.38_GRCh38.p12/GCF_000001405.38_GRCh38.p12_genomic.gtf.gz --output-document=ref-transcripts.gtf.gz
gunzip -c ref-transcripts.gtf.gz > ref-transcripts.gtf
grep -vwE "(start_codon|stop_codon|unknown_transcript_1)" ref-transcripts.gtf > ref-transcripts-sanitized.gtf
wc -l ref-transcripts*.gtf
# 3698302 ref-transcripts.gtf
# 3455709 ref-transcripts-sanitized.gtf
gtfToGenePred -allErrors -ignoreGroupsWithoutExons -genePredExt ref-transcripts-sanitized.gtf ref-transcripts.genePred
wc -l ref-transcripts.genePred
# 172088 ref-transcripts.genePred
mjsteinbaugh commented 5 years ago

Using the sanitized GTF that works directly with gtfToGenePred, now here's what I'm seeing with bcbio 1.1.4. Something is going on during the processing that is adding a trailing "" that isn't present in the original file.

2019-04-08 15:15:48,162 - INFO - Running ANALYSE features
Unpaired type("")/val on end of gtf line 1 of /data00/bcbio/patched/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/tmpcbl/ref$
transcripts.gtf
Creating gffutils database for /data00/bcbio/patched/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/tmpcbl/ref-transcripts.gt$
.
'transcript' or 'gene' entries not found, so inferring their extent. This can be very slow.
Writing out merged GTF file to /data00/bcbio/patched/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/tmpcbl/ref-transcripts.gt$
.
Traceback (most recent call last):
  File "/data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py", line 841, in <module>
    main(args.org_build, args.gtf, args.fasta, genome_dir, args.cores, args)
  File "/data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py", line 299, in main
    gtf_to_refflat(gtf_file)
  File "/data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py", line 440, in gtf_to_refflat
    genepred = gtf_to_genepred(gtf)
  File "/data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py", line 432, in gtf_to_genepred
    subprocess.check_call(cmd.format(**locals()), shell=True)
  File "/data00/bcbio/patched/anaconda/lib/python3.6/subprocess.py", line 291, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'gtfToGenePred -allErrors -ignoreGroupsWithoutExons -genePredExt /data00/bcbio/patch$
d/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/tmpcbl/ref-transcripts.gtf /data00/bcbio/patched/genomes/Hsapiens/GCF_0000014
05.38_GRCh38.p12/tmpcbl/ref-transcripts.genePred' returned non-zero exit status 255.
Creating directories using /data00/bcbio/patched/genomes as the base.
Genomes will be installed into /data00/bcbio/patched/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12.
Installed genome as /data00/bcbio/patched/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/seq/GCF_000001405.38_GRCh38.p12.fa.
Installed GTF as /data00/bcbio/patched/genomes/Hsapiens/GCF_000001405.38_GRCh38.p12/rnaseq/ref-transcripts.gtf.
Creating the seq index.
Creating the bowtie2 index.
Traceback (most recent call last):
  File "/data00/bcbio/tools-patched/bin/bcbio_setup_genome.py", line 308, in <module>
    subprocess.check_call(cmd.format(**locals()), shell=True)
  File "/data00/bcbio/patched/anaconda/lib/python3.6/subprocess.py", line 291, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/data00/bcbio/patched/anaconda/bin/python /data00/biodata/cloudbiolinux/utils/prepar
e_tx_gff.py --cores 28 --genome-dir /data00/bcbio/patched/genomes --gtf /data00/bcbio/patched/genomes/Hsapiens/GCF_000001405
.38_GRCh38.p12/rnaseq/ref-transcripts.gtf Hsapiens GCF_000001405.38_GRCh38.p12' returned non-zero exit status 1.

Here's the original RefSeq GTF, for example:

NC_000001.11    BestRefSeq      gene    11874   14409   .       +       .       gene_id "DDX11L1"; db_xref "GeneID:100287102"; db_xref "HGNC:HGNC:37102"; description "DEAD/H-box helicase 11 like 1"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseudo "true";

Here's the first line of tmpcbl/ref-transcripts.gtf:

NC_000001.11    BestRefSeq      gene    11874   14409   .       +       .       gene_id "DDX11L1"; db_xref "GeneID:100287102"; db_xref "HGNC:HGNC:37102"; description "DEAD/H-box helicase 11 like 1"; gbkey "Gene"; gene "DDX11L1"; gene_biotype "transcribed_pseudogene"; pseudo "true";  ""

For reference, Ensembl GTF files (e.g. Homo_sapiens.GRCh38.96.gtf.gz), also look like this, with no trailing "".

1       havana  gene    11869   14409   .       +       .       gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
roryk commented 5 years ago

Can you send along the top of the ref-transcripts.gtf, the original gtf and the one that works from Ensembl? Like one gene from each. Like in files, not pasted in. I bet there is some trailing whitespace in the RefSeq GTF, but it would be good to test on something small.

roryk commented 5 years ago

That's totally it, the RefSeq GTF has trailing whitespace. If you trim that off during sanitizing it should fix that issue.

Is there a reason you're switching it up? There's a bunch of stuff in bcbio that is really useful that won't work with the RefSeq version of the genome.

gabeng commented 5 years ago

Hi Rory,

I have no idea what is special about the RefSeq flavor of hg38. However, I was considering to try out the newest patchlevel of hg38. I noticed that bcbio is using a 1kG version of hg38 from 2016 without the patches released since then (if I read the provenance files correctly). Do you have an idea if the patches provide any benefit in variant calling performance or any benefit at all?

TIA for your feedback.

mjsteinbaugh commented 5 years ago

@roryk Will do, I'll zip up these files and send you a link on O2. I need to use RefSeq for a multi-omics analysis using some proprietary aligned data that can't be reprocessed from the original FASTQs easily. Otherwise, I'm currently using the bcbio hg38 reference (GenBank GCA_000001405.15).

mjsteinbaugh commented 5 years ago

@gabeng That's correct that bcbio hg38 is sourced from 1000 Genomes. Here's the recipe YAML from cloudbiolinux with more details: https://github.com/chapmanb/cloudbiolinux/blob/master/ggd-recipes/hg38/seq.yaml

Genome indices built from RefSeq (GCF_*) and GenBank (GCA_*; i.e. Ensembl, GENCODE) assemblies will have significant differences in the number of annotated genes and transcripts, so it's important to use a single reference genome per project.

The RefSeq FAQ has a nice explanation:

What is the difference between RefSeq and GenBank?

The GenBank archival sequence database includes publicly available DNA sequences submitted from individual laboratories and large-scale sequencing projects. GenBank is part of the International Nucleotide Sequence Database Collaboration (INSDC) along with the European Nucleotide Archive and the DNA Data Bank of Japan (DDBJ). Submitted sequence data is exchanged daily between the three collaborators to achieve comprehensive worldwide coverage. As an archival database, GenBank can be very redundant for some loci. GenBank sequence records are owned by the original submitter and cannot be altered by a third party.

RefSeq sequences are not part of the INSDC but are derived from INSDC sequences to provide non-redundant curated data representing our current knowledge of known genes. Some records include sequence information gathered from more than one INSDC record. Records may include sequence, descriptive information, publications, or feature annotation that is not available from any single INSDC record. RefSeq records are owned by NCBI and therefore can be updated as needed to maintain current annotation or to incorporate additional information. Also see the appendix provided in the NCBI Handbook, GenBank chapter.

Another distinction is that transcripts and proteins annotated on RefSeq genomic records are instantiated as separate records; in contrast, GenBank only instantiates the proteins annotated on genomic sequence records.

See also this post on Quora.

roryk commented 5 years ago

Thanks, @mjsteinbaugh, no need to pass them on, I think if you just add to your GTF cleaning something to strip off trailing spaces it should fix the issue you were seeing.

mjsteinbaugh commented 5 years ago

For reference, this seems to do the trick for removing trailing whitespace on the RefSeq GTF:

sed -i 's/[[:blank:]]*$//' ref-transcripts.gtf
mjsteinbaugh commented 5 years ago

Okay, I can get bcbio to build bowtie2, minimap2, and STAR indices. However, I'm hitting an error with picard:

[localhost] local: picard -Xms500m -Xmx3500m CreateSequenceDictionary REFERENCE=/data00/bcbio/stable/genomes/Hsapiens/GRCh38_RefSeq_93/seq/GRCh38_RefSeq_93.fa OUTPUT=/data00/bcbio/stable/genomes/Hsapiens/GRCh38_RefSeq_93/seq/GRCh38_RefSeq_93.dict
Traceback (most recent call last):
  File "/data00/bcbio/tools/bin/bcbio_setup_genome.py", line 312, in <module>
    subprocess.check_call(cmd.format(**locals()), shell=True)
  File "/data00/bcbio/stable/anaconda/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/data00/bcbio/stable/anaconda/bin/python /data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py --cores 30 --genome-dir /data00/bcbio/stable/genomes --gtf /data00/bcbio/stable/genomes/Hsapiens/GRCh38_RefSeq_93/rnaseq/ref-transcripts.gtf Hsapiens GRCh38_RefSeq_93' returned non-zero exit status 1
roryk commented 5 years ago

If you run picard -Xms500m -Xmx3500m CreateSequenceDictionary REFERENCE=/data00/bcbio/stable/genomes/Hsapiens/GRCh38_RefSeq_93/seq/GRCh38_RefSeq_93.fa OUTPUT=/data00/bcbio/stable/genomes/Hsapiens/GRCh38_RefSeq_93/seq/GRCh38_RefSeq_93.dict manually, does it give you more information about the error?

mjsteinbaugh commented 5 years ago

I'm running that right now actually

mjsteinbaugh commented 5 years ago

Weird, this works for me:

rm /data00/bcbio/stable/genomes/Hsapiens/GRCh38_RefSeq_93/seq/GRCh38_RefSeq_93.dict
picard -Xms500m -Xmx3500m CreateSequenceDictionary REFERENCE=/data00/bcbio/stable/genomes/Hsapiens/GRCh38_RefSeq_93/seq/GRCh38_RefSeq_93.fa OUTPUT=/data00/bcbio/stable/genomes/Hsapiens/GRCh38_RefSeq_93/seq/GRCh38_RefSeq_93.dict
mjsteinbaugh commented 5 years ago

Ah okay it's this step that's erroring out:

14:00:15.487 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/data00/bcbio/stable/anaconda/share/picard-2.18.26-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Tue Apr 09 14:00:15 EDT 2019] CreateSequenceDictionary OUTPUT=/data00/bcbio/stable/genomes/Hsapiens/GRCh38_RefSeq_93/seq/GRCh38_RefSeq_93.dict REFERENCE=/data00/bcbio/stable/genomes/Hsapiens/GRCh38_RefSeq_93/seq/GRCh38_RefSeq_93.fa    TRUNCATE_NAMES_AT_WHITESPACE=true NUM_SEQUENCES=2147483647 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Tue Apr 09 14:00:15 EDT 2019] Executing as mike@azlabapp35 on Linux 3.10.0-957.10.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_192-b01; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.26-SNAPSHOT
[Tue Apr 09 14:00:32 EDT 2019] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.29 minutes.
Runtime.totalMemory()=595591168
2019-04-09 14:13:25,968 - INFO - Committing changes: 3455000 features
2019-04-09 14:13:26,592 - INFO - Populating features table and first-order relations: 3455703 features
2019-04-09 14:13:26,593 - INFO - Creating relations(parent) index
2019-04-09 14:13:31,792 - INFO - Creating relations(child) index
2019-04-09 14:13:36,443 - INFO - Inferring transcript extents and writing to tempfile
2019-04-09 14:14:03,806 - INFO - Importing inferred features into db
2019-04-09 14:14:32,978 - INFO - Committing changes
2019-04-09 14:14:33,079 - INFO - Creating relations(parent) index
2019-04-09 14:14:38,089 - INFO - Creating relations(child) index
2019-04-09 14:14:42,692 - INFO - Creating features(featuretype) index
2019-04-09 14:14:46,100 - INFO - Creating features (seqid, start, end) index
2019-04-09 14:14:51,403 - INFO - Creating features (seqid, start, end, strand) index
2019-04-09 14:14:56,943 - INFO - Running ANALYSE features
Traceback (most recent call last):
  File "/data00/bcbio/stable/anaconda/lib/R/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py", line 51, in <module>
    for f in HTSeq.GFF_Reader( gtf_file ):
  File "/data00/bcbio/stable/anaconda/lib/python2.7/site-packages/HTSeq/__init__.py", line 208, in __iter__
    ( attr, name ) = parse_GFF_attribute_string( attributeStr, True )
  File "/data00/bcbio/stable/anaconda/lib/python2.7/site-packages/HTSeq/__init__.py", line 161, in parse_GFF_attribute_string
    raise ValueError, "The attribute string seems to contain mismatched quotes."
ValueError: The attribute string seems to contain mismatched quotes.
Creating gffutils database for /data00/bcbio/stable/genomes/Hsapiens/GRCh38_RefSeq_93/tmpcbl/ref-transcripts.gtf.
'transcript' or 'gene' entries not found, so inferring their extent. This can be very slow.
Writing out merged GTF file to /data00/bcbio/stable/genomes/Hsapiens/GRCh38_RefSeq_93/tmpcbl/ref-transcripts.gtf.
Traceback (most recent call last):
  File "/data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py", line 840, in <module>
    main(args.org_build, args.gtf, args.fasta, genome_dir, args.cores, args)
  File "/data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py", line 301, in main
    prepare_dexseq(gtf_file)
  File "/data00/biodata/cloudbiolinux/utils/prepare_tx_gff.py", line 811, in prepare_dexseq
    subprocess.check_call(cmd.format(**locals()), shell=True)
  File "/data00/bcbio/stable/anaconda/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '/data00/bcbio/stable/anaconda/bin/python /data00/bcbio/stable/anaconda/lib/R/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py /data00/bcbio/stable/genomes/Hsapiens/GRCh38_RefSeq_93/tmpcbl/ref-transcripts.gtf /data00/bcbio/stable/genomes/Hsapiens/GRCh38_RefSeq_93/tmpcbl/ref-transcripts.dexseq.gff3' returned non-zero exit status 1
roryk commented 5 years ago

It looks like the GTF file is malformed again, I bet if you look at it you'll see where the issue is.

mjsteinbaugh commented 5 years ago

Yeah the tmp file tmpcbl/ref-transcripts.gtf getting passed to dexseq_prepare_annotation.py is malformed. There's some step during the prepare_tx_gff.py call involving picard and gffutils that corrupts the original GTF input file.

Loading this up in R with rtracklayer confirms the embedded double-quote issue, although I'm not exactly sure where.

x <- rtracklayer::import("malformed.gtf")
Warning in readGFF(filepath, version = version, filter = filter) :
  the value part of some of the tag value pairs contains embedded double-quotes
Calls: <Anonymous> ... import -> import -> .local -> readGFFAsGRanges -> readGFF

Passing in the original GTF file in used in the bcbio_setup_genome.py call works though:

/data00/bcbio/stable/anaconda/bin/python /data00/bcbio/stable/anaconda/lib/R/library/DEXSeq/python_scripts/dexseq_prepare_annotation.py /home/mike/refseq-trouble/GCF_000001405.38_GRCh38.p12_genomic-clean.gtf ref-transcripts.dexseq.gff3
roryk commented 5 years ago

I'd look at the malformed GTF and see if you can figure out where the issue is; it might be the original GTF file needs to get fixed some more.

roryk commented 4 years ago

Thanks, did you get this solved?

mjsteinbaugh commented 4 years ago

@roryk I punted on using RefSeq for this project and we've just been using Ensembl / GENCODE annotations here

roryk commented 4 years ago

Ok, thanks! Closing and we can re-visit if there is need for this particular build.

amizeranschi commented 4 years ago

For the record, I've also been having trouble while running bcbio_setup_genome.py on a RefSeq GTF file. I'm guessing that the general RefSeq pipeline setup renders their GTF files unusable for bcbio by default, so I'm describing below what worked for me in case it helps anyone else.

I've also had to remove trailing spaces and any lines containing transcript_id "unknown_transcript_1", as described above. Then, in order to fix the error The attribute string seems to contain mismatched quotes (https://github.com/bcbio/bcbio-nextgen/issues/2760#issuecomment-481375203), I used the following perl command to replace any semicolons with & signs when enclosed within quotes inside the GTF annotations (as suggested here):

perl -e 'open(FILE, "original_file.gtf"); while(<FILE>){$_ =~ s/([^\"]);/\1\&/g; print $_;}' > edited_file.gtf

This got the GTF file usable with bcbio_setup_genome.py.

mjsteinbaugh commented 4 years ago

Thanks @amizeranschi, I'll try to circle back to this and see if it works on my end. bcbio currently uses gffutils internally to process GTF/GFF3 files. Maybe we can just add this step manually for RefSeq genome files prior to handing off to gffutils downstream in the genome build steps.

amizeranschi commented 4 years ago

Good idea, it would be useful to have RefSeq GTF files automatically sanitized somehow.

Alternatively, this could be done directly from bcbio_setup_genome.py when the user mentions refseq in the --buildversion option, which was recently added (https://github.com/bcbio/bcbio-nextgen/commit/38badc09e1c0d2168c8b6105c21ff7726259ce91).

roryk commented 4 years ago

Thanks for the information, it is helpful to have-- unfortunately it's pretty tough to handle all GTF files since there is no actual specification for them, so I think it's best to just compile the workarounds necessary to get various ones to work, rather than implement fixes for all of the different types.