cellranger-arc mkref error with ensembl gtf

rtyags commented 2 years ago

Hi,

I have been trying to run cellranger-arc mkref using hg19.ensGene.gtf (https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.ensGene.gtf.gz) and running into an error that says "records for gene_id ENSG00000238009 are not contiguous in the file". This is probably because the gtf is sorted by location, and if a gene overlaps another, then the gene records aren't going to be contiguous in the gtf.

Is there a solution to this? How do people run mkref with ensembl gtf, since presumably this issue will always be present in those gtfs. is it possible for cellranger to include a utility to pre-process gtfs, so that it doesn't throw this error?

Thanks

danliangunc commented 2 years ago

DId you solve the problem? I have the same error.

chbk commented 2 years ago

The error comes from cellranger-arc-2.0.0/lib/python/atac_rna/reference.py.

If anyone is interested, I made a python script to sort the gtf file for cellranger. It requires pandas and natsort.

./fixgtf.py -i input.gtf -o output.sorted.gtf

#!/usr/bin/env python3

import argparse
import csv
import natsort
import pandas

# Parse arguments
parser = argparse.ArgumentParser()
parser.add_argument(
  '-i',
  dest = 'input',
  required = True,
  help = 'Input gtf'
)
parser.add_argument(
  '-o',
  dest = 'output',
  required = True,
  help = 'Output sorted gtf'
)
args = parser.parse_args()

# Read gtf input
gtf_columns = {
  'chromosome': 'str',
  'source': 'str',
  'feature': 'str',
  'start': 'uint64',
  'end': 'uint64',
  'score': 'str',
  'strand': 'str',
  'frame': 'str',
  'attribute': 'str'
}

gtf = pandas.read_csv(
  args.input,
  sep = '\t',
  comment = '#',
  names = gtf_columns.keys(),
  dtype = gtf_columns
)

# Get gene id and transcript id for each row
gtf['gene'] = gtf['attribute'].str.extract(r'gene_id "(.+?)"')
gtf['transcript'] = gtf['attribute'].str.extract(r'transcript_id "(.+?)"')

# Get genes start and end positions
genes = gtf[gtf['feature'] == 'gene'][
  ['chromosome', 'strand', 'gene', 'start', 'end']
].rename(
  columns = {
    'start': 'gene_start',
    'end': 'gene_end'
  }
).set_index(
  ['chromosome', 'strand', 'gene']
)

# Get transcripts start and end positions
transcripts = gtf[gtf['feature'] == 'transcript'][
  ['chromosome', 'strand', 'transcript', 'start', 'end']
].rename(
  columns = {
    'start': 'transcript_start',
    'end': 'transcript_end',
  }
).set_index(
  ['chromosome', 'strand', 'transcript']
)

# Add gene and transcript start and end positions to each row
gtf = gtf.set_index(
  ['chromosome', 'strand', 'gene']
).merge(
  genes,
  how = 'left',
  on = ['chromosome', 'strand', 'gene']
).reset_index().set_index(
  ['chromosome', 'strand', 'transcript']
).merge(
  transcripts,
  how = 'left',
  on = ['chromosome', 'strand', 'transcript']
).reset_index()

# Sort rows
gtf = gtf.sort_values(
  by = [
    'chromosome',
    'strand',
    'gene_start',
    'gene_end',
    'gene',
    'transcript_start',
    'transcript_end',
    'transcript',
    'feature',
    'start',
    'end'
  ],
  key = lambda x: (
    [0 if i == 'gene' else 1 if i == 'transcript' else 2 for i in x]
    if x.name == 'feature'
    else natsort.natsort_key(x)
  )
)

# Write gtf to output
gtf.to_csv(
  args.output,
  sep = '\t',
  columns = gtf_columns.keys(),
  header = False,
  index = False,
  quoting = csv.QUOTE_NONE,
  float_format = '%.10g'
)

brettolsen commented 2 years ago

This issue has been resolved in the most recent 2.0.1 release of cellranger-arc, available for download here: https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/downloads/latest

smahaffey commented 2 years ago

The issue does not seem to be resolved. I'm using 2.0.1 and I'm getting this error on an Rn7 GTF RefSeq file downloaded from UCSC. I will try @chbk 's python script to sort it first.

seburgess commented 2 years ago

thank you chbk the perl script worked beautifully. Still had the problem with cellranger-arc 2.0.1

mossconfuse commented 1 year ago

I experienced the same issue with using cellranger-arc mkref 2.0.1.

I started with a gff3 file that I downloaded from ncbi. When I convert it to gtf with agat_convert_sp_gff2gtf.pl, mkref runs to completion. When I convert it with gffread -T -o, I get the "records for gene_id X are not contiguous in the file" error. When I look at rows with gene X using grep, the difference is additional rows for the agat gtf with utr values in the 3rd column. Otherwise the order of rows for the gene appear the same. When I run @chbk 's script on my gffread file, it resolves the issue. Thanks @chbk.

10XGenomics / cellranger

cellranger-arc mkref error with ensembl gtf #133