Closed olechnwin closed 12 months ago
Hi Cen,
I suspect you have lines in the "gencode.v38.chr_patch_hapl_scaff.annotation.ct5.lifted.polished.sort.name.bed" file that have duplicated transcript ID's. Can you do a check on both BED files?
You can use something like:
Cat bed_file | awk '{print $4}' | sort | uniq -c | sort -nr
Thank you, Richard
Thanks for giving me the commands. That's very useful.
I think I don't have it the gencode bed file:
cat gencode.v38.chr_patch_hapl_scaff.annotation.ct5.lifted.polished.sort.name.bed | awk '{print $4}' | uniq -c | sort -nr | head
1 ENSG00000288725.1;ENST00000684388.1;NA;transcript
1 ENSG00000288724.1;ENST00000683399.1;NA;transcript
1 ENSG00000288723.1;ENST00000684005.1;NA;transcript
1 ENSG00000288722.1;ENST00000610495.2;NA;transcript
1 ENSG00000288721.1;ENST00000684631.1;NA;transcript
1 ENSG00000288721.1;ENST00000683313.1;NA;transcript
1 ENSG00000288721.1;ENST00000682596.1;NA;transcript
1 ENSG00000288720.1;ENST00000684202.1;NA;transcript
1 ENSG00000288720.1;ENST00000682011.1;NA;transcript
1 ENSG00000288718.1;ENST00000684015.1;NA;transcript
It's probably here?
cat split_10k/merged_annos_a673_cds_ensembl_filt2.bed | awk '{print $4}' | uniq -c | sort -nr | head
1 G999;G999.1;G999;G999.1;full_length;no_hit;NMD3;F2
1 G9999;G9999.1;G9999;G9999.1;5prime_degrade;no_hit;NMD2;F3
1 G9998;G9998.1;G9998;G9998.1;full_length;no_hit;NMD2;F2
1 G9996;G9996.1;G9996;G9996.1;full_length;no_hit;prot_ok;F3
1 G9995;G9995.1;G9995;G9995.1;full_length;no_hit;prot_ok;F3
1 G9994;G9994.9;G9994;G9994.9;full_length;no_hit;prot_ok;F3
1 G9994;G9994.8;G9994;G9994.8;full_length;no_hit;NMD2;F3
1 G9994;G9994.7;G9994;G9994.7;full_length;no_hit;NMD3;F2
1 G9994;G9994.6;G9994;G9994.6;full_length;no_hit;prot_ok;F2
1 G9994;G9994.5;G9994;G9994.5;full_length;no_hit;NMD1;F2
Sorry this is the first time I'm doing genome annotation. I got lost a lot. Really appreciate all your help.
Thank you, Cen
Hi Cen,
Actually could you run this command:
cat bed_file | awk '{print $4}' | awk -F ";" '{print $2}' | sort | uniq -c | sort -nr
Because of the additional fields the other command I gave you may not be picking up something.
Thank you, Richard
Hi Richard,
Here are the outputs:
cat split_10k/gencode.v38.chr_patch_hapl_scaff.annotation.ct5.lifted.polished.sort.name.bed | awk '{print $4}' | awk -F ";" '{print $2}' | sort | uniq -c | sort -nr | head
1 ENST00000684777.1
1 ENST00000684775.1
1 ENST00000684774.1
1 ENST00000684773.1
1 ENST00000684772.1
1 ENST00000684771.1
1 ENST00000684769.1
1 ENST00000684768.1
1 ENST00000684767.1
1 ENST00000684766.1
cat split_10k/merged_annos_a673_cds_ensembl_filt2.bed | awk '{print $4}' | awk -F ";" '{print $2}' | sort | uniq -c | sort -nr | head
771 ENST00000597492.1
466 ENST00000651981.1
360 ENST00000672276.1
344 ENST00000427105.1
296 ENST00000473086.3
277 ENST00000395391.2
242 ENST00000558775.5
238 ENST00000515303.2
237 ENST00000675526.1
216 ENST00000706171.1
Thank you!
Hi Cen,
Ok yeah that's your problem right there. I am not sure why but the file "merged_annos_a673_cds_ensembl_filt2.bed" has a lot of duplicated ensembl transcript ID's.
How did you make this file?
Thank you, Richard
Hi Richard,
Sorry for the late reply. I probably did too many TAMA merge? In the beginning, I had 3 Iso-Seq samples that I ran through the nfcore-isoseq pipeline. Then, I took the annotation results and merged it with annotation file that I got from running liftoff. The filelist is this:
ASP14_T1.bed no_cap 1,2,1 isoseq1
ASP14_T2.bed no_cap 1,2,1 isoseq2
ASP14_T3.bed no_cap 1,2,1 isoseq3
gencode.v38.chr_patch_hapl_scaff.annotation.ct5.lifted.polished.sort.name.bed no_cap 2,1,2 gencode
I took the merged annotation and ran the ORF prediction pipeline:
#converting bed to fasta
bed=merged_annos_a673_sort.bed
fasta=scaffolds_FINAL.fasta
outfile=merged_annos_a673_sort.fasta
bedtools getfasta -name -split -s -fi ${fasta} -bed ${bed} -fo ${outfile}
#getting ORF from transcript sequences
outfile=merged_annos_a673_sort_orf.fasta
fasta=merged_annos_a673_sort.fasta
python ~/opt/tama/tama_go/orf_nmd_predictions/tama_orf_seeker.py -f ${fasta} -o ${outfile}
# split files
fasta=merged_annos_a673_sort_orf.fasta
output_prefix=merged_annos_a673_sort_split
cd $(dirname $fasta)
python ~/opt/tama/tama_go/split_files/tama_fasta_splitter.py $fasta $output_prefix 10000
# make database
ref=~/hg38/Homo_sapiens.GRCh38.pep.all.fa
#uniref=~/uniref/uniref100.fasta
cd ~/hg38/protein
# make database first
makeblastdb -in $ref -dbtype prot -out ensembl
# run blastp in parallel
orf=merged_annos_a673_sort_split
cd $(dirname $orf)
blastp -evalue 1e-10 -num_threads $SLURM_CPUS_PER_TASK -db ~/hg38/protein/ensembl -query ${orf}_${split_num}.fa -out blastp_ensembl_rslt_${split_num}.out
# concatenate and parse for top hits
orf=merged_annos_a673_sort_split
cd $(dirname $orf)
# concat the blastp output files
blastp=blastp_ensembl_rslt.txt
cat blastp_ensembl_rslt_*.out > $blastp
outfile=${blastp/.txt/}_parsed.txt
python ~/opt/tama/tama_go/orf_nmd_predictions/tama_orf_blastp_parser.py -b ${blastp} -o ${outfile} -f ensembl
# create new bed file with CDS regions
orf=blastp_ensembl_rslt_parsed.txt
cd $(dirname $orf)
bed=merged_annos_a673_sort.bed
fasta=merged_annos_a673_sort.fasta
outfile=merged_annos_a673_cds_ensembl.bed
python ~/opt/tama/tama_go/orf_nmd_predictions/tama_cds_regions_bed_add.py -p ${orf} -a ${bed} -f ${fasta} -o ${outf
ile}
# change the format so that the first id displayed is the ensembl id
python ~/opt/tama/tama_go/format_converter/tama_format_id_filter.py -b ${outfile} \
-o ${outfile/.bed/_filt.bed} \
-s ensembl_orf
python ~/opt/tama/tama_go/format_converter/tama_convert_bed_gtf_ensembl_no_cds.py \
${outfile/.bed/_filt.bed} ${outfile/.bed/_filt.gtf}
orf=blastp_ensembl_rslt_parsed.txt
cd $(dirname $orf)
outfile=merged_annos_a673_cds_ensembl.bed
# filter out bad_match
grep -v bad_match ${outfile/.bed/_filt.bed} > ${outfile/.bed/_filt2.bed}
# merge with gencode liftoff
outfile2=merged_annos_a673_3
python ~/opt/tama/tama_merge.py -f filelist2.txt -d merge_dup -p $outfile2
#filelist2
gencode.v38.chr_patch_hapl_scaff.annotation.ct5.lifted.polished.sort.name.bed no_cap 2,1,2 gencode2
merged_annos_a673_cds_ensembl_filt2.bed no_cap 1,1,1 isoseq_pred
# change the format so that the first id displayed is the ensembl id
python ~/opt/tama/tama_go/format_converter/tama_format_id_filter.py -b merged_annos_a673_3.bed \
-o ${outfile2/.bed/_filt.bed} \
-s ensembl_orf
Thank you, Cen
Hi Cen,
Can you go through your output files starting from most recent and run the command I gave to you to see where the duplicated ID's formed?
You should look for this in the bash output: 771 ENST00000597492.1
Basically the first column shows how many instances of that transcript ID there are in the bed file.
So if you see any ID's that occur more than once then that is an issue.
Thank you, Richard
Hi Richard,
I checked the output files starting with the most recent. All of them has duplicated IDs until tama_cds_regions_bed_add
.
# create new bed file with CDS regions
orf=blastp_ensembl_rslt_parsed.txt
cd $(dirname $orf)
bed=merged_annos_a673_sort.bed
fasta=merged_annos_a673_sort.fasta
outfile=merged_annos_a673_cds_ensembl.bed
python ~/opt/tama/tama_go/orf_nmd_predictions/tama_cds_regions_bed_add.py -p ${orf} -a ${bed} -f ${fasta} -o ${outf
ile}
# change the format so that the first id displayed is the ensembl id
python ~/opt/tama/tama_go/format_converter/tama_format_id_filter.py -b ${outfile} \
-o ${outfile/.bed/_filt.bed} \
-s ensembl_orf
The output file from tama_format_id_filter
has duplicates:
cat split_10k/merged_annos_a673_cds_ensembl_filt.bed | awk '{print $4}' | awk -F ";" '{print $2}' | sort | uniq -c | sort -nr | head
771 ENST00000597492.1
The output file from previous step, tama_cds_regions_bed_add
does not have duplicate:
cat split_10k/merged_annos_a673_cds_ensembl.bed | awk '{print $4}' | awk -F ";" '{print $2}' | sort | uniq -c | sort -nr | head
1 G9999.1
Thank you, Cen
Hi Cen,
Ok it looks like you used TAMA Merge or the ORF/NMD pipeline in the wrong way at some point. Where are the Ensembl transcript ID's coming from? This is the issue. Basically at some point you are setting it up so that multiple transcript models from your annotation are getting the same Ensembl transcript ID's. But I am not sure how you did this.
Thank you, Richard
Hi Richard,
Thanks for helping me with this issue. I thought the Ensembl transcript ID was from this database Homo_sapiens.GRCh38.pep.all.fa that I used instead of the uniref100.fasta used in your example pipeline.
Do you think it's the first TAMA Merge that I did with the outputs from nfcore pipeline?
This is how I did the very first TAMA merge combining outputs from the pipeline and liftoff annotation:
python /home/gdlessnicklab/cxt050/opt/tama/tama_merge.py -f filelist.txt -d merge_dup -p merged_annos_a673_2 -s gencode
Thank you,
Cen
Hi Cen,
If this is when the duplicate transcript ID's appeared then yes this would be the issue. You may want to change the order in which you merge things to avoid the issue of making duplicate ID's by carrying over Ensembl ID's. You may also want to try increasing the wobble thresholds to avoid isoform expansion.
Thank you, Richard
Hi Richard,
This is one example of duplicate IDs, right? In this bed file, G50504 has multiple transcript ID (i.e. G50504.1, G50504.2, etc) but they all are assigned to ENST00000447973.5
I have an additional question. The G50504 is actually a fusion gene (EWSR1 fused with FLI1). As shown above, it was annotated to ENSG00000182944 which is EWSR1. Do you have any advice on how to annotate fusion gene?
Thanks for your advice above. I will try to change the order of merge and try the wobble threshold.
Thank you, Cen
Hi Richard,
Additional observation that hopefully help in troubleshooting the duplicate trans id?
This is the input file for running tama_cds_regions_bed_add.py
:
code to run tama_cds_regions_bed_add.py
:
# create new bed file with CDS regions
orf=blastp_ensembl_rslt_parsed.txt
cd $(dirname $orf)
bed=merged_annos_a673_sort.bed
fasta=merged_annos_a673_sort.fasta
outfile=merged_annos_a673_cds_ensembl.bed
python ~/opt/tama/tama_go/orf_nmd_predictions/tama_cds_regions_bed_add.py -p ${orf} -a ${bed} -f ${fasta} -o ${outf
ile}
This is the output file, where you now see different transcripts G8112.5, G8112.4 all have the same transcript id ENST00000428768:
Thank you, Cen
Hi Richard,
I think the problem was because multiple transcript models matched to the same hits with blastp. For example the transcripts for G50493.2, .3 and .4 all three match to ENST00000406323.3
cat split_10k/merged_annos_a673_cds_ensembl.bed | grep G50493
scaffold_8 7517499 7691397 G50493;G50493.1;ENSG00000183579.16,ENST00000544604.7;full_length;50_match;NMD1;F1 40 + 7517694 7684480 200,0,255 9 495,126,75,132,111,168,103,1753,3917 0,103474,158897,159702,161183,163118,164791,165599,169981
scaffold_8 7551300 7687587 G50493;G50493.2;ENSG00000183579.16,ENST00000406323.3;full_length;50_match;NMD1;F1 40 + 7620973 7684480 200,255,0 9 93,126,75,132,111,168,103,1753,107 0,69673,125096,125901,127382,129317,130990,131798,136180
scaffold_8 7620949 7687867 G50493;G50493.3;ENSG00000183579.16,ENST00000406323.3;full_length;50_match;NMD1;F1 40 + 7620973 7684480 200,255,0 8 150,75,132,111,168,103,1753,387 0,55447,56252,57733,59668,61341,62149,66531
scaffold_8 7680698 7687721 G50493;G50493.4;ENSG00000183579.16,ENST00000406323.3;5prime_degrade;50_match;NMD1;F1 40 + 7680698 7684480 255,100,0 4 87,103,1753,241 0,1592,2400,6782
IGV screenshot:
Do you have any suggestions on how to resolve this?
Thank you, Cen
Hi Cen,
Sorry for the late response. I haven't had much time to respond to issues this year unfortunately.
This is why I contain both the TAMA ID's and the Ensembl ID's because it can happen that different transcripts hit to the same Ensembl ID. This is just the property of the data. Hope this makes sense but let me know if you need more explanation.
Thank you, Richard
Hi Richard,
I had run TAMA ORF prediction and after that I was trying to merge this annotation to a liftoff annotation I had. I got the duplicate trans id and I had already made sure the filelist does have unique names.
This is the last command I ran for the ORF prediction pipeline:
These are the commands to merge the annotation from ORF prediction and a liftoff annotation:
My filelist2.txt
Here is the exact error:
How do I fix this error ? Thank you again for your help. Cen