Issues with indexing concatenated gtf file

wmacnair commented 1 year ago

Hello :)

I am trying to map to a combined human and EBV genome, and I'm running into issues... Hopefully this problem is both (a) kind of within scope of what you do / what I can reasonably ask here, and (b) in the appropriate repo. Please tell me if either is off!

What I have done:

download human fasta and gtf files from here
download EBV fasta and gtf files from here

I first concatenated them naively

simpleaf --version
# simpleaf 0.14.0

# trim gtf headers
grep -vE "^#+.+" ebv.gtf > ebv_clean.gtf
grep -vE "^#+.+" human_head.gtf > human_clean.gtf

# concatenate
cat ebv.fasta human.fasta > human_ebv.fasta
cat ebv_clean.gtf human_clean.gtf > human_ebv.gtf

then ran simpleaf index:

simpleaf index \
  --output /Users/macnairw/work/packages/scProcess/data/alevin_fry_home/ebv_splici \
  --fasta /Users/macnairw/work/packages/scProcess/data/reference_genome/refdata-ebv/human_ebv.fasta \
  --gtf /Users/macnairw/work/packages/scProcess/data/reference_genome/refdata-ebv/human_ebv.gtf \
  --rlen 91 \
  --threads 4 \
  --use-piscem

but got errors:

2023-07-13T11:44:17.547973Z  INFO simpleaf::simpleaf_commands::indexing: preparing to make reference with roers
Error: invalid record

Caused by:
    0: invalid attributes
    1: invalid entry
    2: empty input

The gtf files have different attribute lists, so I then tried removing everything except _geneid:

# trim gtf headers, remove everything after gene_id
grep -vE "^#+.+" ebv.gtf | awk -F';' '{print $1";"}' > ebv_clean.gtf
grep -vE "^#+.+" human_head.gtf | awk -F';' '{print $1";"}' > human_clean.gtf
cat ebv_clean.gtf human_clean.gtf > human_ebv.gtf

# check
head human_ebv.gtf
# NC_007605.1     RefSeq  gene    1691    5856    .       +       .       gene_id "HHV4_BNRF1";
# NC_007605.1     RefSeq  CDS     1736    5689    .       +       0       gene_id "HHV4_BNRF1";
# NC_007605.1     RefSeq  start_codon     1736    1738    .       +       0       gene_id "HHV4_BNRF1";
# NC_007605.1     RefSeq  stop_codon      5690    5692    .       +       0       gene_id "HHV4_BNRF1";
# NC_007605.1     RefSeq  transcript      6629    6795    .       +       .       gene_id "unassigned_gene_1";
# NC_007605.1     RefSeq  exon    6629    6795    .       +       .       gene_id "unassigned_gene_1";
# NC_007605.1     RefSeq  transcript      6956    7128    .       +       .       gene_id "unassigned_gene_2";
# NC_007605.1     RefSeq  exon    6956    7128    .       +       .       gene_id "unassigned_gene_2";
# NC_007605.1     RefSeq  gene    9631    10262   .       +       .       gene_id "HHV4_BCRF1.1";
# NC_007605.1     RefSeq  CDS     9675    10184   .       +       0       gene_id "HHV4_BCRF1.1";

# run simpleaf
simpleaf index \
  --output /Users/macnairw/work/packages/scProcess/data/alevin_fry_home/ebv_splici \
  --fasta /Users/macnairw/work/packages/scProcess/data/reference_genome/refdata-ebv/human_ebv.fasta \
  --gtf /Users/macnairw/work/packages/scProcess/data/reference_genome/refdata-ebv/human_ebv.gtf \
  --rlen 91 \
  --threads 4 \
  --use-piscem

and, yay, new error message!

2023-07-13T12:02:06.658749Z  INFO simpleaf::simpleaf_commands::indexing: preparing to make reference with roers
2023-07-13T12:02:06.660394Z  INFO grangers::grangers::reader::gtf: Finished parsing the input file. Found 0 comments and 881 records.
2023-07-13T12:02:06.660900Z  INFO roers: Built the Grangers object for 881 records
Error: Found exon features with a null transcript_id; Cannot proceed

I then tried including just _geneid and _transcriptid, and I think I got this to work. However, _transcriptid is completely missing for some entries in the human gtf, so I got the same error message.

Things I'm not sure about:

which is the right repo to ask this question in...
what the requirements on gtf format are for simpleaf / alevin-fry / roers

Any thoughts? I've included the ebv.gtf and the first 1k lines of the human.gtf in the zip file attached.

Please do gently suggest I go somewhere else if this is out of scope :)

Cheers Will

human_ebv_gtf_files.zip

rob-p commented 1 year ago

Thanks for the detailed report, Will! We'll look into this ASAP (cc @DongzeHE).

Off the top of my head, there is no obvious reason it shouldn't work with properly formatted fasta and GTF files. As a start to the investigation, I tried to process it with pyroe (as you may know, becuase pyranges keeps breaking, we recently moved from pyroe to our in-house roers tool for constructing references). One area where pyroe is still better than roers is error reporting. Running the combined files through pyroe I get:

WARNING:root: Found records with missing gene_id/gene_name field. These records are reported in test_human_ebv_splici_pyroe/missing_gene_id_or_name_records.gtf. Imputed 681 missing gene_name using gene_id.
WARNING:root: Found transcripts without corresponding transcript feature record; Those transcripts were not used to extract unspliced sequences.
WARNING:root: Found genes without corresponding gene feature record. Those genes were not used to extract unspliced sequences.
WARNING:root: Found transcripts whose boundaries defined in their transcript feature record do not match their exons' bounds. However, those boundaries were still used to extract unspliced sequences.
WARNING:root: Found genes whose boundaries defined in the gene feature records do not equal to their exons' bounds. However, those boundaries were still used to extract unspliced sequences.
WARNING:root: A clean GTF file with all issues fixed is generated at test_human_ebv_splici_pyroe/clean_gtf.gtf. If needed, please rerun using this clean GTF file.
Traceback (most recent call last):
  File "/Users/rob/miniconda3/bin/pyroe", line 271, in <module>
    no_flanking_merge=args.no_flanking_merge,
  File "/Users/rob/miniconda3/lib/python3.7/site-packages/pyroe/make_txome.py", line 638, in make_splici_txome
    introns = gr.features.introns(by="transcript")
  File "/Users/rob/miniconda3/lib/python3.7/site-packages/pyranges/genomicfeatures.py", line 254, in introns
    result = pyrange_apply(_introns2, by_gr, exons, **kwargs)
  File "/Users/rob/miniconda3/lib/python3.7/site-packages/pyranges/multithreaded.py", line 293, in pyrange_apply
    result = call_f(function, nparams, df, odf, kwargs)
  File "/Users/rob/miniconda3/lib/python3.7/site-packages/pyranges/multithreaded.py", line 23, in call_f
    return f.remote(df, odf, **kwargs)
  File "/Users/rob/miniconda3/lib/python3.7/site-packages/pyranges/genomicfeatures.py", line 554, in _introns2
    id_column=id_column)
AssertionError: The transcript_ids need to be unique to compute the introns.

So it seems there are several things pyroe is unhappy about here ;P. The "clean" GTF is attached here (I subset only human chr17-19 to make it faster/smaller), though I've not tested yet if that could be parsed. I have a meeting now but will look into this more later today.

Best, Rob

clean_gtf.gtf.zip

rob-p commented 1 year ago

Ok, small update. With the clean_gtf, roers did run to completion, with a few warnings:

roers make-ref -a i data/test_human_ebv.fa test_human_ebv_splici_pyroe/clean_gtf.gtf test_human_ebv_splici
2023-07-13T17:45:12.858227Z  INFO grangers::reader::gtf: Finished parsing the input file. Found 0 comments and 207838 records.
2023-07-13T17:45:13.009489Z  INFO roers: Built the Grangers object for 207838 records
2023-07-13T17:45:13.138283Z  WARN grangers::grangers_info: The exon_number column contains null values. Will compute the exon number from exon start position .
2023-07-13T17:45:13.306708Z  INFO roers: Proceed 174519 exon records from 28503 transcripts
2023-07-13T17:45:21.367358Z  INFO roers: Processing 146016 intronic records
2023-07-13T17:45:34.285890Z  INFO roers: Done!

So, somehow there are several issues with the input (concatenated) GTF that pyroe was able to fix. I will now try the two genomes separately (I know the human already works so will look into the EBV).

rob-p commented 1 year ago

Ok @wmacnair,

Progress. This is kind of hilarious ;). The problem is with the EBV GTF file. In addition to the warnings that you see above (which are structural but maybe not serious?) the major problem is that the GTF file is ill-formed. Each line ends with a ` character after the final;. This is technically not allowed. While some parsers are "flexible" and allow this misformation, thenoodles[https://github.com/zaeleus/noodles] library we use ingrangersis not. I trimmed the trailing ` from all lines and then concatenated the GTFs together and it worked. I trimmed with:

cat human_ebv.gtf | awk '{ sub(/[ \t]+$/, ""); print }' > human_ebv_trimmed.gtf

and then concatenated as above. Let me know if this resolves things for you.

Best, Rob

wmacnair commented 1 year ago

Hey, thanks for looking through all of this in time for my morning :)

So it's just a dumb formatting bug in the EBV gtf file, with an extra space or tab at the end that shouldn't be there? A nice easy fix, at least!

With this fix, it ran fine 🎉

2023-07-14T07:44:32.678297Z  INFO simpleaf::simpleaf_commands::indexing: preparing to make reference with roers
2023-07-14T07:44:54.313831Z  INFO grangers::grangers::reader::gtf: Finished parsing the input file. Found 0 comments and 2766650 records.
2023-07-14T07:44:56.406021Z  INFO roers: Built the Grangers object for 2766650 records
2023-07-14T07:44:57.649529Z  WARN roers: Found missing gene_id and/or gene_name; Imputing. If both missing, will impute using transcript_id; Otherwise, will impute using the existing one.
2023-07-14T07:44:57.744350Z  INFO roers: Proceed 1305563 exon records from 199197 transcripts
2023-07-14T07:45:04.041891Z  INFO roers: Processing 1106366 intronic records
2023-07-14T07:45:04.631998Z  INFO roers: Done!
2023-07-14T07:45:04.693867Z  INFO simpleaf::simpleaf_commands::indexing: piscem build cmd : /Users/macnairw/opt/anaconda3/envs/af/bin/piscem build -k 31 -m 19 -o /Users/macnairw/work/packages/scProcess/data/alevin_fry_home/ebv_splici/index/piscem_idx -s /Users/macnairw/work/packages/scProcess/data/alevin_fry_home/ebv_splici/ref/roers_ref.fa --threads 4
2023-07-14T07:45:12.002763Z  INFO simpleaf::utils::prog_utils: command returned successfully (exit status: 0)

Thanks so much! Will

COMBINE-lab / simpleaf

Issues with indexing concatenated gtf file #98