LuChenLab / SCAPE

Single Cell Alternative Polyadenylation using Expectation-maximization
MIT License
6 stars 2 forks source link

Error preparing #6

Closed sunrain1234 closed 2 years ago

sunrain1234 commented 2 years ago

Hello,

I'm trying to prepare my annotation downloaded from CellRanger website, but there is an error:

pybedtools.helpers.BEDToolsError: 
Command was:

    bedtools merge -s -i /tmp/pybedtools._rmom2my.tmp

Error message was:
Error: Sorted input specified, but the file /tmp/pybedtools._rmom2my.tmp has the following out of order record
chr1    6457264 6457360 chr1;ENSG00000187017;ENST00000645284;ESPN

My bedtools bersion is v2.26.0. This is confusing because when I used the annotation file downloaded from Ensembl, this error didn't occur. The difference between these two annotation files is that the annotation file from Ensembl doesn't have "chr" in chrom ID. But I don't know if it is the key reason causing this error. I found that you also used the annotation file from CellRanger in the example of Run SCAPE on single end scRNAseq datasets (10X genomics), so could you please provide the annotation file in this example. Then I can run this example and check if it works out. Thank you.

zhou-ran commented 2 years ago

Hi,

Could you send me the link of gtf, and I will try it.

Best, Ran

maximumko commented 2 years ago

Hello,

I'm trying to prepare my annotation downloaded from CellRanger website, but there is an error:

pybedtools.helpers.BEDToolsError: 
Command was:

  bedtools merge -s -i /tmp/pybedtools._rmom2my.tmp

Error message was:
Error: Sorted input specified, but the file /tmp/pybedtools._rmom2my.tmp has the following out of order record
chr1  6457264 6457360 chr1;ENSG00000187017;ENST00000645284;ESPN

My bedtools bersion is v2.26.0. This is confusing because when I used the annotation file downloaded from Ensembl, this error didn't occur. The difference between these two annotation files is that the annotation file from Ensembl doesn't have "chr" in chrom ID. But I don't know if it is the key reason causing this error. I found that you also used the annotation file from CellRanger in the example of Run SCAPE on single end scRNAseq datasets (10X genomics), so could you please provide the annotation file in this example. Then I can run this example and check if it works out. Thank you.

Hi, I met the same issue. Have you solved this problem? I downloaded the gtf file in ensembl-93 version and sorted it with bedtools=2.26.0.

Best, Max

zhou-ran commented 2 years ago

Hi all, This is the bedtools bug. I downloaded newest gtf, here is the code

gtf_url="http://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.primary_assembly.annotation.gtf.gz"
gtf_in="gencode.v32.primary_assembly.annotation.gtf"
curl -sS "$gtf_url" | zcat > "$gtf_in"

gtf_modified="$(basename "$gtf_in").modified"

ID="(ENS(MUS)?[GTE][0-9]+)\.([0-9]+)"
cat "$gtf_in" \
    | sed -E 's/gene_id "'"$ID"'";/gene_id "\1"; gene_version "\3";/' \
    | sed -E 's/transcript_id "'"$ID"'";/transcript_id "\1"; transcript_version "\3";/' \
    | sed -E 's/exon_id "'"$ID"'";/exon_id "\1"; exon_version "\3";/' \
    > "$gtf_modified"

BIOTYPE_PATTERN=\
"(protein_coding|lncRNA|\
IG_C_gene|IG_D_gene|IG_J_gene|IG_LV_gene|IG_V_gene|\
IG_V_pseudogene|IG_J_pseudogene|IG_C_pseudogene|\
TR_C_gene|TR_D_gene|TR_J_gene|TR_V_gene|\
TR_V_pseudogene|TR_J_pseudogene)"
GENE_PATTERN="gene_type \"${BIOTYPE_PATTERN}\""
TX_PATTERN="transcript_type \"${BIOTYPE_PATTERN}\""
READTHROUGH_PATTERN="tag \"readthrough_transcript\""
PAR_PATTERN="tag \"PAR\""

# Construct the gene ID allowlist. We filter the list of all transcripts
# based on these criteria:
#   - allowable gene_type (biotype)
#   - allowable transcript_type (biotype)
#   - no "PAR" tag (only present for Y chromosome PAR)
#   - no "readthrough_transcript" tag
# We then collect the list of gene IDs that have at least one associated
# transcript passing the filters.
cat "$gtf_modified" \
    | awk '$3 == "transcript"' \
    | grep -E "$GENE_PATTERN" \
    | grep -E "$TX_PATTERN" \
    | grep -Ev "$READTHROUGH_PATTERN" \
    | grep -Ev "$PAR_PATTERN" \
    | sed -E 's/.*(gene_id "[^"]+").*/\1/' \
    | sort \
    | uniq \
    > "gene_allowlist"

gtf_filtered="$(basename "$gtf_in").filtered"

grep -E "^#" "$gtf_modified" > "$gtf_filtered"
# Filter to the gene allowlist
grep -Ff "gene_allowlist" "$gtf_modified" \
    >> "$gtf_filtered"

python scape/main.py prepare --gtf gencode.v32.primary_assembly.annotation.gtf.filtered.sorted.gz --prefix gencode.v32

I also get a same error,

Error: Sorted input specified, but the file /tmp/pybedtools._rmom2my.tmp has the following out of order record
chr1    6457264 6457360 chr1;ENSG00000187017;ENST00000645284;ESPN

I found the bedtools don't properly sort the bed file,

chr1    6457264 6457360 chr1;ENSG00000187017;ENST00000477679;ESPN       .       +
chr1    6457264 6457360 chr1;ENSG00000187017;ENST00000461727;ESPN       .       +
chr1    6457264 6457360 chr1;ENSG00000187017;ENST00000434576;ESPN       .       +
chr1    6457264 6457360 chr1;ENSG00000187017;ENST00000633239;ESPN       .       +
chr1    6457264 6457360 chr1;ENSG00000187017;ENST00000475228;ESPN       .       +
chr1    **6457265** **6457265** chr1;ENSG00000187017;ENST00000416731;ESPN       .       +
chr1    6457264 6457360 chr1;ENSG00000187017;ENST00000645284;ESPN       .       +

I updated a new version of prepare module, and I'm going to close this issue for now. Please feel free to reopen it if you have any questions.

Best, Ran