Closed rtyags closed 8 months ago
DId you solve the problem? I have the same error.
The error comes from cellranger-arc-2.0.0/lib/python/atac_rna/reference.py
.
If anyone is interested, I made a python script to sort the gtf file for cellranger. It requires pandas and natsort.
./fixgtf.py -i input.gtf -o output.sorted.gtf
#!/usr/bin/env python3
import argparse
import csv
import natsort
import pandas
# Parse arguments
parser = argparse.ArgumentParser()
parser.add_argument(
'-i',
dest = 'input',
required = True,
help = 'Input gtf'
)
parser.add_argument(
'-o',
dest = 'output',
required = True,
help = 'Output sorted gtf'
)
args = parser.parse_args()
# Read gtf input
gtf_columns = {
'chromosome': 'str',
'source': 'str',
'feature': 'str',
'start': 'uint64',
'end': 'uint64',
'score': 'str',
'strand': 'str',
'frame': 'str',
'attribute': 'str'
}
gtf = pandas.read_csv(
args.input,
sep = '\t',
comment = '#',
names = gtf_columns.keys(),
dtype = gtf_columns
)
# Get gene id and transcript id for each row
gtf['gene'] = gtf['attribute'].str.extract(r'gene_id "(.+?)"')
gtf['transcript'] = gtf['attribute'].str.extract(r'transcript_id "(.+?)"')
# Get genes start and end positions
genes = gtf[gtf['feature'] == 'gene'][
['chromosome', 'strand', 'gene', 'start', 'end']
].rename(
columns = {
'start': 'gene_start',
'end': 'gene_end'
}
).set_index(
['chromosome', 'strand', 'gene']
)
# Get transcripts start and end positions
transcripts = gtf[gtf['feature'] == 'transcript'][
['chromosome', 'strand', 'transcript', 'start', 'end']
].rename(
columns = {
'start': 'transcript_start',
'end': 'transcript_end',
}
).set_index(
['chromosome', 'strand', 'transcript']
)
# Add gene and transcript start and end positions to each row
gtf = gtf.set_index(
['chromosome', 'strand', 'gene']
).merge(
genes,
how = 'left',
on = ['chromosome', 'strand', 'gene']
).reset_index().set_index(
['chromosome', 'strand', 'transcript']
).merge(
transcripts,
how = 'left',
on = ['chromosome', 'strand', 'transcript']
).reset_index()
# Sort rows
gtf = gtf.sort_values(
by = [
'chromosome',
'strand',
'gene_start',
'gene_end',
'gene',
'transcript_start',
'transcript_end',
'transcript',
'feature',
'start',
'end'
],
key = lambda x: (
[0 if i == 'gene' else 1 if i == 'transcript' else 2 for i in x]
if x.name == 'feature'
else natsort.natsort_key(x)
)
)
# Write gtf to output
gtf.to_csv(
args.output,
sep = '\t',
columns = gtf_columns.keys(),
header = False,
index = False,
quoting = csv.QUOTE_NONE,
float_format = '%.10g'
)
This issue has been resolved in the most recent 2.0.1 release of cellranger-arc, available for download here: https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/downloads/latest
The issue does not seem to be resolved. I'm using 2.0.1 and I'm getting this error on an Rn7 GTF RefSeq file downloaded from UCSC. I will try @chbk 's python script to sort it first.
thank you chbk the perl script worked beautifully. Still had the problem with cellranger-arc 2.0.1
I experienced the same issue with using cellranger-arc mkref 2.0.1.
I started with a gff3 file that I downloaded from ncbi. When I convert it to gtf with agat_convert_sp_gff2gtf.pl, mkref runs to completion. When I convert it with gffread -T -o, I get the "records for gene_id X are not contiguous in the file" error. When I look at rows with gene X using grep, the difference is additional rows for the agat gtf with utr values in the 3rd column. Otherwise the order of rows for the gene appear the same. When I run @chbk 's script on my gffread file, it resolves the issue. Thanks @chbk.
Hi,
I have been trying to run cellranger-arc mkref using hg19.ensGene.gtf (https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.ensGene.gtf.gz) and running into an error that says "records for gene_id ENSG00000238009 are not contiguous in the file". This is probably because the gtf is sorted by location, and if a gene overlaps another, then the gene records aren't going to be contiguous in the gtf.
Is there a solution to this? How do people run mkref with ensembl gtf, since presumably this issue will always be present in those gtfs. is it possible for cellranger to include a utility to pre-process gtfs, so that it doesn't throw this error?
Thanks