GenomeRIK / tama

Transcriptome Annotation by Modular Algorithms (for long read RNA sequencing data)
GNU General Public License v3.0
128 stars 25 forks source link

Error with duplicate trans id #102

Closed olechnwin closed 12 months ago

olechnwin commented 1 year ago

Hi Richard,

I had run TAMA ORF prediction and after that I was trying to merge this annotation to a liftoff annotation I had. I got the duplicate trans id and I had already made sure the filelist does have unique names.

This is the last command I ran for the ORF prediction pipeline:

orf=blastp_ensembl_rslt_parsed.txt
bed=merged_annos_a673_sort.bed
fasta=merged_annos_a673_sort.fasta
outfile=merged_annos_a673_cds_ensembl.bed
python ~/opt/tama/tama_go/orf_nmd_predictions/tama_cds_regions_bed_add.py -p ${orf} -a ${bed} -f ${fasta} -o ${outfil
e}

These are the commands to merge the annotation from ORF prediction and a liftoff annotation:

outfile=merged_annos_a673_cds_ensembl.bed
# filter out bad_match
grep -v bad_match ${outfile/.bed/_filt.bed} > ${outfile/.bed/_filt2.bed}

outfile2=merged_annos_a673_3
python ~/opt/tama/tama_merge.py -f filelist2.txt -d merge_dup -p $outfile2

My filelist2.txt

gencode.v38.chr_patch_hapl_scaff.annotation.ct5.lifted.polished.sort.name.bed   no_cap  2,1,2   gencode2
merged_annos_a673_cds_ensembl_filt2.bed no_cap  1,1,1   isoseq_pred

Here is the exact error:

+ grep -v bad_match merged_annos_a673_cds_ensembl_filt.bed
+ outfile2=merged_annos_a673_3
+ python ~/opt/tama/tama_merge.py -f filelist2.txt -d merge_dup -p merged_annos_a673_3
Default collapse exon ends flag will be used: common_ends
Default 5 prime threshold: 20
Default exon end/splice junction threshold: 10
Default 3 prime threshold: 20
Default source ID merge flag: no_source_id
Default CDS merge flag: no_cds
opening file list
opening bed list
opening bed list
running hunter_prey_nocap
3
hunter id unsearched_count: gencode2_ENST00000503774.1
['gencode2_ENST00000503774.1', 'isoseq_pred_G1.1', 'gencode2_ENST00000616656.1']
hunter id unsearched_count: isoseq_pred_G1.1
['isoseq_pred_G1.1', 'gencode2_ENST00000616656.1']
invoke merge_a_b_groups_nocap 4 isoseq_pred_G1.1 gencode2_ENST00000616656.1
hunter id new_hunter: gencode2_ENST00000616656.1
['gencode2_ENST00000616656.1']
['gencode2_ENST00000616656.1']
1
hunter id unsearched_count: gencode2_ENST00000616656.1
['gencode2_ENST00000616656.1']
gencode2_ENST00000503774.1
isoseq_pred_G1.1
gencode2_ENST00000616656.1
Error with duplicate trans id

How do I fix this error ? Thank you again for your help. Cen

GenomeRIK commented 1 year ago

Hi Cen,

I suspect you have lines in the "gencode.v38.chr_patch_hapl_scaff.annotation.ct5.lifted.polished.sort.name.bed" file that have duplicated transcript ID's. Can you do a check on both BED files?

You can use something like:

Cat bed_file | awk '{print $4}' | sort | uniq -c | sort -nr

Thank you, Richard

olechnwin commented 1 year ago

Thanks for giving me the commands. That's very useful.

I think I don't have it the gencode bed file:

cat gencode.v38.chr_patch_hapl_scaff.annotation.ct5.lifted.polished.sort.name.bed | awk '{print $4}' | uniq -c | sort -nr | head

     1 ENSG00000288725.1;ENST00000684388.1;NA;transcript
      1 ENSG00000288724.1;ENST00000683399.1;NA;transcript
      1 ENSG00000288723.1;ENST00000684005.1;NA;transcript
      1 ENSG00000288722.1;ENST00000610495.2;NA;transcript
      1 ENSG00000288721.1;ENST00000684631.1;NA;transcript
      1 ENSG00000288721.1;ENST00000683313.1;NA;transcript
      1 ENSG00000288721.1;ENST00000682596.1;NA;transcript
      1 ENSG00000288720.1;ENST00000684202.1;NA;transcript
      1 ENSG00000288720.1;ENST00000682011.1;NA;transcript
      1 ENSG00000288718.1;ENST00000684015.1;NA;transcript

It's probably here? cat split_10k/merged_annos_a673_cds_ensembl_filt2.bed | awk '{print $4}' | uniq -c | sort -nr | head

  1 G999;G999.1;G999;G999.1;full_length;no_hit;NMD3;F2
  1 G9999;G9999.1;G9999;G9999.1;5prime_degrade;no_hit;NMD2;F3
  1 G9998;G9998.1;G9998;G9998.1;full_length;no_hit;NMD2;F2
  1 G9996;G9996.1;G9996;G9996.1;full_length;no_hit;prot_ok;F3
  1 G9995;G9995.1;G9995;G9995.1;full_length;no_hit;prot_ok;F3
  1 G9994;G9994.9;G9994;G9994.9;full_length;no_hit;prot_ok;F3
  1 G9994;G9994.8;G9994;G9994.8;full_length;no_hit;NMD2;F3
  1 G9994;G9994.7;G9994;G9994.7;full_length;no_hit;NMD3;F2
  1 G9994;G9994.6;G9994;G9994.6;full_length;no_hit;prot_ok;F2
  1 G9994;G9994.5;G9994;G9994.5;full_length;no_hit;NMD1;F2

Sorry this is the first time I'm doing genome annotation. I got lost a lot. Really appreciate all your help.

Thank you, Cen

GenomeRIK commented 1 year ago

Hi Cen,

Actually could you run this command:

cat bed_file | awk '{print $4}' | awk -F ";" '{print $2}' | sort | uniq -c | sort -nr

Because of the additional fields the other command I gave you may not be picking up something.

Thank you, Richard

olechnwin commented 1 year ago

Hi Richard,

Here are the outputs:

cat split_10k/gencode.v38.chr_patch_hapl_scaff.annotation.ct5.lifted.polished.sort.name.bed | awk '{print $4}' | awk -F ";" '{print $2}' | sort | uniq -c | sort -nr | head
      1 ENST00000684777.1
      1 ENST00000684775.1
      1 ENST00000684774.1
      1 ENST00000684773.1
      1 ENST00000684772.1
      1 ENST00000684771.1
      1 ENST00000684769.1
      1 ENST00000684768.1
      1 ENST00000684767.1
      1 ENST00000684766.1
cat split_10k/merged_annos_a673_cds_ensembl_filt2.bed | awk '{print $4}' | awk -F ";" '{print $2}' | sort | uniq -c | sort -nr | head
    771 ENST00000597492.1
    466 ENST00000651981.1
    360 ENST00000672276.1
    344 ENST00000427105.1
    296 ENST00000473086.3
    277 ENST00000395391.2
    242 ENST00000558775.5
    238 ENST00000515303.2
    237 ENST00000675526.1
    216 ENST00000706171.1

Thank you!

GenomeRIK commented 1 year ago

Hi Cen,

Ok yeah that's your problem right there. I am not sure why but the file "merged_annos_a673_cds_ensembl_filt2.bed" has a lot of duplicated ensembl transcript ID's.

How did you make this file?

Thank you, Richard

olechnwin commented 1 year ago

Hi Richard,

Sorry for the late reply. I probably did too many TAMA merge? In the beginning, I had 3 Iso-Seq samples that I ran through the nfcore-isoseq pipeline. Then, I took the annotation results and merged it with annotation file that I got from running liftoff. The filelist is this:

ASP14_T1.bed    no_cap  1,2,1   isoseq1
ASP14_T2.bed    no_cap  1,2,1   isoseq2
ASP14_T3.bed    no_cap  1,2,1   isoseq3
gencode.v38.chr_patch_hapl_scaff.annotation.ct5.lifted.polished.sort.name.bed   no_cap  2,1,2   gencode

I took the merged annotation and ran the ORF prediction pipeline:

#converting bed to fasta
bed=merged_annos_a673_sort.bed
fasta=scaffolds_FINAL.fasta
outfile=merged_annos_a673_sort.fasta
bedtools getfasta -name -split -s -fi ${fasta} -bed ${bed} -fo ${outfile}

#getting ORF from transcript sequences
outfile=merged_annos_a673_sort_orf.fasta
fasta=merged_annos_a673_sort.fasta
python ~/opt/tama/tama_go/orf_nmd_predictions/tama_orf_seeker.py -f ${fasta} -o ${outfile}

# split files
fasta=merged_annos_a673_sort_orf.fasta
output_prefix=merged_annos_a673_sort_split
cd $(dirname $fasta)
python ~/opt/tama/tama_go/split_files/tama_fasta_splitter.py $fasta $output_prefix 10000

# make database
ref=~/hg38/Homo_sapiens.GRCh38.pep.all.fa
#uniref=~/uniref/uniref100.fasta
cd ~/hg38/protein
# make database first
makeblastdb -in $ref -dbtype prot -out ensembl

# run blastp in parallel
orf=merged_annos_a673_sort_split
cd $(dirname $orf)
blastp -evalue 1e-10 -num_threads $SLURM_CPUS_PER_TASK -db ~/hg38/protein/ensembl -query ${orf}_${split_num}.fa -out blastp_ensembl_rslt_${split_num}.out

# concatenate and parse for top hits
orf=merged_annos_a673_sort_split
cd $(dirname $orf)
# concat the blastp output files
blastp=blastp_ensembl_rslt.txt
cat blastp_ensembl_rslt_*.out > $blastp
outfile=${blastp/.txt/}_parsed.txt
python ~/opt/tama/tama_go/orf_nmd_predictions/tama_orf_blastp_parser.py -b ${blastp} -o ${outfile} -f ensembl

# create new bed file with CDS regions
orf=blastp_ensembl_rslt_parsed.txt
cd $(dirname $orf)
bed=merged_annos_a673_sort.bed
fasta=merged_annos_a673_sort.fasta
outfile=merged_annos_a673_cds_ensembl.bed
python ~/opt/tama/tama_go/orf_nmd_predictions/tama_cds_regions_bed_add.py -p ${orf} -a ${bed} -f ${fasta} -o ${outf
ile}
# change the format so that the first id displayed is the ensembl id
python ~/opt/tama/tama_go/format_converter/tama_format_id_filter.py -b ${outfile} \
        -o ${outfile/.bed/_filt.bed} \
        -s ensembl_orf
python ~/opt/tama/tama_go/format_converter/tama_convert_bed_gtf_ensembl_no_cds.py \
        ${outfile/.bed/_filt.bed} ${outfile/.bed/_filt.gtf}

orf=blastp_ensembl_rslt_parsed.txt
cd $(dirname $orf)
outfile=merged_annos_a673_cds_ensembl.bed
# filter out bad_match
grep -v bad_match ${outfile/.bed/_filt.bed} > ${outfile/.bed/_filt2.bed}

# merge with gencode liftoff
outfile2=merged_annos_a673_3
python ~/opt/tama/tama_merge.py -f filelist2.txt -d merge_dup -p $outfile2

#filelist2
gencode.v38.chr_patch_hapl_scaff.annotation.ct5.lifted.polished.sort.name.bed   no_cap  2,1,2   gencode2
merged_annos_a673_cds_ensembl_filt2.bed no_cap  1,1,1   isoseq_pred

# change the format so that the first id displayed is the ensembl id
python ~/opt/tama/tama_go/format_converter/tama_format_id_filter.py -b merged_annos_a673_3.bed \
        -o ${outfile2/.bed/_filt.bed} \
        -s ensembl_orf

Thank you, Cen

GenomeRIK commented 1 year ago

Hi Cen,

Can you go through your output files starting from most recent and run the command I gave to you to see where the duplicated ID's formed?

You should look for this in the bash output: 771 ENST00000597492.1

Basically the first column shows how many instances of that transcript ID there are in the bed file.

So if you see any ID's that occur more than once then that is an issue.

Thank you, Richard

olechnwin commented 1 year ago

Hi Richard,

I checked the output files starting with the most recent. All of them has duplicated IDs until tama_cds_regions_bed_add.

# create new bed file with CDS regions
orf=blastp_ensembl_rslt_parsed.txt
cd $(dirname $orf)
bed=merged_annos_a673_sort.bed
fasta=merged_annos_a673_sort.fasta
outfile=merged_annos_a673_cds_ensembl.bed
python ~/opt/tama/tama_go/orf_nmd_predictions/tama_cds_regions_bed_add.py -p ${orf} -a ${bed} -f ${fasta} -o ${outf
ile}
# change the format so that the first id displayed is the ensembl id
python ~/opt/tama/tama_go/format_converter/tama_format_id_filter.py -b ${outfile} \
        -o ${outfile/.bed/_filt.bed} \
        -s ensembl_orf

The output file from tama_format_id_filter has duplicates:

cat split_10k/merged_annos_a673_cds_ensembl_filt.bed | awk '{print $4}' | awk -F ";" '{print $2}' | sort | uniq -c | sort -nr | head
    771 ENST00000597492.1

The output file from previous step, tama_cds_regions_bed_add does not have duplicate:

cat split_10k/merged_annos_a673_cds_ensembl.bed | awk '{print $4}' | awk -F ";" '{print $2}' | sort | uniq -c | sort -nr | head
      1 G9999.1

Thank you, Cen

GenomeRIK commented 1 year ago

Hi Cen,

Ok it looks like you used TAMA Merge or the ORF/NMD pipeline in the wrong way at some point. Where are the Ensembl transcript ID's coming from? This is the issue. Basically at some point you are setting it up so that multiple transcript models from your annotation are getting the same Ensembl transcript ID's. But I am not sure how you did this.

Thank you, Richard

olechnwin commented 1 year ago

Hi Richard,

Thanks for helping me with this issue. I thought the Ensembl transcript ID was from this database Homo_sapiens.GRCh38.pep.all.fa that I used instead of the uniref100.fasta used in your example pipeline.

Do you think it's the first TAMA Merge that I did with the outputs from nfcore pipeline? This is how I did the very first TAMA merge combining outputs from the pipeline and liftoff annotation: python /home/gdlessnicklab/cxt050/opt/tama/tama_merge.py -f filelist.txt -d merge_dup -p merged_annos_a673_2 -s gencode Thank you, Cen

GenomeRIK commented 1 year ago

Hi Cen,

If this is when the duplicate transcript ID's appeared then yes this would be the issue. You may want to change the order in which you merge things to avoid the issue of making duplicate ID's by carrying over Ensembl ID's. You may also want to try increasing the wobble thresholds to avoid isoform expansion.

Thank you, Richard

olechnwin commented 1 year ago

Hi Richard,

This is one example of duplicate IDs, right? In this bed file, G50504 has multiple transcript ID (i.e. G50504.1, G50504.2, etc) but they all are assigned to ENST00000447973.5

image

I have an additional question. The G50504 is actually a fusion gene (EWSR1 fused with FLI1). As shown above, it was annotated to ENSG00000182944 which is EWSR1. Do you have any advice on how to annotate fusion gene?

Thanks for your advice above. I will try to change the order of merge and try the wobble threshold.

Thank you, Cen

olechnwin commented 1 year ago

Hi Richard,

Additional observation that hopefully help in troubleshooting the duplicate trans id? This is the input file for running tama_cds_regions_bed_add.py: image

code to run tama_cds_regions_bed_add.py:

# create new bed file with CDS regions
orf=blastp_ensembl_rslt_parsed.txt
cd $(dirname $orf)
bed=merged_annos_a673_sort.bed
fasta=merged_annos_a673_sort.fasta
outfile=merged_annos_a673_cds_ensembl.bed
python ~/opt/tama/tama_go/orf_nmd_predictions/tama_cds_regions_bed_add.py -p ${orf} -a ${bed} -f ${fasta} -o ${outf
ile} 

This is the output file, where you now see different transcripts G8112.5, G8112.4 all have the same transcript id ENST00000428768: image

Thank you, Cen

olechnwin commented 1 year ago

Hi Richard,

I think the problem was because multiple transcript models matched to the same hits with blastp. For example the transcripts for G50493.2, .3 and .4 all three match to ENST00000406323.3

cat split_10k/merged_annos_a673_cds_ensembl.bed | grep G50493
scaffold_8  7517499 7691397 G50493;G50493.1;ENSG00000183579.16,ENST00000544604.7;full_length;50_match;NMD1;F1   40  +   7517694 7684480 200,0,255   9   495,126,75,132,111,168,103,1753,3917    0,103474,158897,159702,161183,163118,164791,165599,169981
scaffold_8  7551300 7687587 G50493;G50493.2;ENSG00000183579.16,ENST00000406323.3;full_length;50_match;NMD1;F1   40  +   7620973 7684480 200,255,0   9   93,126,75,132,111,168,103,1753,107  0,69673,125096,125901,127382,129317,130990,131798,136180
scaffold_8  7620949 7687867 G50493;G50493.3;ENSG00000183579.16,ENST00000406323.3;full_length;50_match;NMD1;F1   40  +   7620973 7684480 200,255,0   8   150,75,132,111,168,103,1753,387 0,55447,56252,57733,59668,61341,62149,66531
scaffold_8  7680698 7687721 G50493;G50493.4;ENSG00000183579.16,ENST00000406323.3;5prime_degrade;50_match;NMD1;F1    40  +   7680698 7684480 255,100,0   4   87,103,1753,241 0,1592,2400,6782

IGV screenshot:

image

Do you have any suggestions on how to resolve this?

Thank you, Cen

GenomeRIK commented 12 months ago

Hi Cen,

Sorry for the late response. I haven't had much time to respond to issues this year unfortunately.

This is why I contain both the TAMA ID's and the Ensembl ID's because it can happen that different transcripts hit to the same Ensembl ID. This is just the property of the data. Hope this makes sense but let me know if you need more explanation.

Thank you, Richard