Closed mickey-spongebob closed 3 years ago
Hi mickey-spongebob, Thank you for your interest in pygtftk. Could you
1 / try using an example file:
gtftk get_example | gtftk short_long
2/ Then you could you try to use convert_ensembl (Description: Convert the GTF file to ensembl format. It will essentially add a 'transcript' feature and 'gene' feature when required. This command can be viewed as a 'groomer' command for those starting with a non ensembl GTF).
gtftk convert_ensembl -i ... | gtftk short_long
3/ Alternatively you could perhaps provide us with a subset of the file so that we may have a look.
4/ Use "gtftk -s" to tell us more about your installation.
Best Best
Thanks for the quick reply!! :-)
I tried all your solutions and it does seem that it runs fine on the cluster with minimal RAM :-) Once I converted the gtf files to Ensembl format.
Another problem arose though. I checked the output of my files and it seems that the command 'gtftk short_long -i
Hope that helps and thank you so much!
Best, kevin
Ahhhhh,
I apologise. Upon closer inspection, it seems that 'gtftk short_long -l' looks for the maximum end position of the transcript to determine size? Unless I'm mistaken...
So for example, I took a snapshot of the first few lines of a gtf file
From this snapshot, the gee "XLOC_00001" spans from 3111 - 19680, and has three transcripts in this locus. After I run 'gtftk short_long -i
Do you think this is fixable or perhaps just pre-merging the transcript models on the same gene is best for now :-)
Thank you again and sorry for being a headache.
Best, kevin
Hi Kevin, What it computes is the size of the mature transcrit (sum of exons).
Description: Select the shortest mature transcript (i.e without introns) for each gene or the longest if the -l arguments is used.
Indeed, this is what it does:
Let's select the 3 genes with highest number of transcripts
# 1. get an example dataset
# 2. add the number of transcript to each gene feature (key=nb_tx)
# 3. select only gene features (-g)
# 4. tabulate the GTF (column gene_id and nb_tx)
# 5. Sort by number of transcrit
# 6. Extract the gene with the highest number of transcrits
gtftk get_example -d "mini_real" | \
gtftk nb_transcripts | \
gtftk select_by_key -g | \
gtftk tabulate -x -k gene_id,nb_tx | \
sort -nk 2,2 | \
tail -n 3
ENSG00000049759 38
ENSG00000072778 39
ENSG00000096093 39
Now check the size of corresponding longuest transcripts
# 1. get an example dataset
# 2. Select the 3 example genes
# 3. Compute the size of the mature transcript
# 4. Compute the size of exon (just to check the sum)
# 5. Select only 'transcript' features
# 6. Convert to tabulated format
# 7. Sort by size
gtftk get_example -d "mini_real" |
gtftk select_by_key -k gene_id -v ENSG00000049759,ENSG00000072778,ENSG00000096093 | gtftk feature_size -t mature_rna |
gtftk exon_sizes |
gtftk select_by_key -t| gtftk tabulate -x -k gene_id,transcript_id,feat_size,exon_sizes |
sort -nk3,3
ENSG00000049759 ENST00000456986 8436
ENSG00000096093 ENST00000371068 6987
ENSG00000072778 ENST00000322910 2352
This is what I get regarding the longuest (-l) with short_long
gtftk get_example -d "mini_real" |
gtftk select_by_key -k gene_id -v ENSG00000049759,ENSG00000072778,ENSG00000096093 | gtftk short_long -l |
gtftk feature_size -t mature_rna |
gtftk select_by_key -t| gtftk tabulate -x -k gene_id,transcript_id,feat_size
ENSG00000072778 ENST00000322910 2352
ENSG00000049759 ENST00000456986 8436
ENSG00000096093 ENST00000371068 6987
Regarding the shortest, we expect the 3 transcripts below:
# 1. get an example dataset
# 2. Select the 3 example genes
# 3. Compute the size of the mature transcript
# 4. Compute the size of exon (just to check the sum)
# 5. Select only 'transcript' features
# 6. Convert to tabulated format
# 7. Sort by size
gtftk get_example -d "mini_real" |
gtftk select_by_key -k gene_id -v ENSG00000049759,ENSG00000072778,ENSG00000096093 | gtftk feature_size -t mature_rna |
gtftk exon_sizes |
gtftk select_by_key -t| gtftk tabulate -x -k gene_id,transcript_id,feat_size,exon_sizes |
sort -nrk3,3
ENSG00000096093 ENST00000481466 387 25,211,151
ENSG00000072778 ENST00000582450 294 113,181
ENSG00000049759 ENST00000591579 144 92,52
This is what I get regarding the shortest with short_long
gtftk get_example -d "mini_real" |
gtftk select_by_key -k gene_id -v ENSG00000049759,ENSG00000072778,ENSG00000096093 | gtftk short_long |
gtftk feature_size -t mature_rna |
gtftk select_by_key -t| gtftk tabulate -x -k gene_id,transcript_id,feat_size
I think what you want is different. You are interested at the genomic size if I understand well.
Hmm I see...
I am interested in the transcript size but perhaps my emphasis was looking at the genomic region. Nevertheless, I've looked at the other transcripts and it seems that it is consistently selecting the longest transcripts. So maybe this is an example instance where the genomic coordinates are misleading for transcript size (i.e. splicing etc.)
I'm sorry for bothering you and thank you for all your help!! :-)
I'll close the issue if you're ok with it :-)
Hallo,
Hope this finds you well. Thank you for the tool! I'd like to use the 'gtftk short_long' command (with the option -l) in order to extract the longest transcript for a given gene from my gtf file. I seemed to have successfully installed 'pygtftk' software via Conda on our SLURM cluster (also including the gtftk -b output in my .bashrc profile).
Unfortunately when I try to run the command like so gtftk short_long -i animal_species.gtf -o animal_species_long.gtf -l -V 3, it gives me an error like so:
/var/spool/slurm/job20922901/slurm_script: line 10: 11278 Segmentation fault
My original .gtf file is ~109M and I provided a total RAM of 800G and it still gives me this error. I then cut my file size to 5.3K (subset to 3 genes with their transcripts) giving the job 48G RAM, and it still gives me the same error above:
/var/spool/slurm/job20922906/slurm_script: line 10: 8212 Segmentation fault
Is this the expected behaviour? Am I missing something?
Any advice on the matter would be kindly appreciated :-)
Best wishes, kevin