dputhier / pygtftk

A python package and a set of shell commands to handle GTF files
GNU General Public License v3.0
45 stars 6 forks source link

segmentation fault on a SLURM cluster for 'gtftk short_long' command #163

Closed mickey-spongebob closed 3 years ago

mickey-spongebob commented 3 years ago

Hallo,

Hope this finds you well. Thank you for the tool! I'd like to use the 'gtftk short_long' command (with the option -l) in order to extract the longest transcript for a given gene from my gtf file. I seemed to have successfully installed 'pygtftk' software via Conda on our SLURM cluster (also including the gtftk -b output in my .bashrc profile).

Unfortunately when I try to run the command like so gtftk short_long -i animal_species.gtf -o animal_species_long.gtf -l -V 3, it gives me an error like so:

/var/spool/slurm/job20922901/slurm_script: line 10: 11278 Segmentation fault

My original .gtf file is ~109M and I provided a total RAM of 800G and it still gives me this error. I then cut my file size to 5.3K (subset to 3 genes with their transcripts) giving the job 48G RAM, and it still gives me the same error above:

/var/spool/slurm/job20922906/slurm_script: line 10: 8212 Segmentation fault

Is this the expected behaviour? Am I missing something?

Any advice on the matter would be kindly appreciated :-)

Best wishes, kevin

dputhier commented 3 years ago

Hi mickey-spongebob, Thank you for your interest in pygtftk. Could you

1 / try using an example file:

     gtftk get_example | gtftk short_long

2/ Then you could you try to use convert_ensembl (Description: Convert the GTF file to ensembl format. It will essentially add a 'transcript' feature and 'gene' feature when required. This command can be viewed as a 'groomer' command for those starting with a non ensembl GTF).

       gtftk convert_ensembl -i ... | gtftk short_long

3/ Alternatively you could perhaps provide us with a subset of the file so that we may have a look.

4/ Use "gtftk -s" to tell us more about your installation.

Best Best

mickey-spongebob commented 3 years ago

Thanks for the quick reply!! :-)

I tried all your solutions and it does seem that it runs fine on the cluster with minimal RAM :-) Once I converted the gtf files to Ensembl format.

Another problem arose though. I checked the output of my files and it seems that the command 'gtftk short_long -i -o -l -V 3' returns the shortest transcripts rather than the longest. is there a way around this?

Hope that helps and thank you so much!

Best, kevin

mickey-spongebob commented 3 years ago

Ahhhhh,

I apologise. Upon closer inspection, it seems that 'gtftk short_long -l' looks for the maximum end position of the transcript to determine size? Unless I'm mistaken...

So for example, I took a snapshot of the first few lines of a gtf file

Screenshot 2021-07-12 at 16 04 46

From this snapshot, the gee "XLOC_00001" spans from 3111 - 19680, and has three transcripts in this locus. After I run 'gtftk short_long -i -o -l -V 3', the longest transcript selected from this region is "TCONS_000000003". I assume this is so, because this transcript ends at position 19680 (i.e. the end of gene XLOC_00001), even though the actual sequence itself is small (i.e. ~5000bp) whereas "TCONS_000000001" spans ~12000bp.

Do you think this is fixable or perhaps just pre-merging the transcript models on the same gene is best for now :-)

Thank you again and sorry for being a headache.

Best, kevin

dputhier commented 3 years ago

Hi Kevin, What it computes is the size of the mature transcrit (sum of exons).

          Description: Select the shortest mature transcript (i.e without introns) for each gene or the longest if the -l arguments is used.

Indeed, this is what it does:

Let's select the 3 genes with highest number of transcripts

    # 1. get an example dataset
    # 2. add the number of transcript to each gene feature (key=nb_tx)
    # 3. select only gene features (-g)
    # 4. tabulate the GTF (column gene_id and nb_tx)
    # 5. Sort by number of transcrit
    # 6. Extract the gene with the highest number of transcrits
    gtftk get_example -d "mini_real" | \
    gtftk nb_transcripts  | \
    gtftk select_by_key -g | \
    gtftk tabulate -x -k gene_id,nb_tx | \
    sort -nk 2,2 | \
    tail -n 3
    ENSG00000049759 38
    ENSG00000072778 39
    ENSG00000096093 39

Now check the size of corresponding longuest transcripts

    # 1. get an example dataset
    # 2. Select the 3 example genes
    # 3. Compute the size of the mature transcript
    # 4. Compute the size of exon (just to check the sum)
    # 5. Select only 'transcript' features 
    # 6. Convert to tabulated format
    # 7. Sort by size
    gtftk get_example -d "mini_real" | 
    gtftk select_by_key -k gene_id -v ENSG00000049759,ENSG00000072778,ENSG00000096093 | gtftk feature_size -t mature_rna | 
    gtftk exon_sizes | 
    gtftk select_by_key -t| gtftk tabulate -x -k gene_id,transcript_id,feat_size,exon_sizes | 
    sort -nk3,3

    ENSG00000049759 ENST00000456986 8436
    ENSG00000096093 ENST00000371068 6987
    ENSG00000072778 ENST00000322910 2352

This is what I get regarding the longuest (-l) with short_long

     gtftk get_example -d "mini_real" | 
     gtftk select_by_key -k gene_id -v ENSG00000049759,ENSG00000072778,ENSG00000096093 | gtftk short_long -l | 
     gtftk feature_size -t mature_rna | 
     gtftk select_by_key -t| gtftk tabulate -x -k gene_id,transcript_id,feat_size

    ENSG00000072778 ENST00000322910 2352
    ENSG00000049759 ENST00000456986 8436
    ENSG00000096093 ENST00000371068 6987

Regarding the shortest, we expect the 3 transcripts below:

    # 1. get an example dataset
    # 2. Select the 3 example genes
    # 3. Compute the size of the mature transcript
    # 4. Compute the size of exon (just to check the sum)
    # 5. Select only 'transcript' features 
    # 6. Convert to tabulated format
    # 7. Sort by size
    gtftk get_example -d "mini_real" | 
    gtftk select_by_key -k gene_id -v ENSG00000049759,ENSG00000072778,ENSG00000096093 | gtftk feature_size -t mature_rna | 
    gtftk exon_sizes | 
    gtftk select_by_key -t| gtftk tabulate -x -k gene_id,transcript_id,feat_size,exon_sizes | 
    sort -nrk3,3

    ENSG00000096093 ENST00000481466 387 25,211,151
    ENSG00000072778 ENST00000582450 294 113,181
    ENSG00000049759 ENST00000591579 144 92,52

This is what I get regarding the shortest with short_long

     gtftk get_example -d "mini_real" | 
     gtftk select_by_key -k gene_id -v ENSG00000049759,ENSG00000072778,ENSG00000096093 | gtftk short_long  | 
     gtftk feature_size -t mature_rna | 
     gtftk select_by_key -t| gtftk tabulate -x -k gene_id,transcript_id,feat_size
dputhier commented 3 years ago

I think what you want is different. You are interested at the genomic size if I understand well.

mickey-spongebob commented 3 years ago

Hmm I see...

I am interested in the transcript size but perhaps my emphasis was looking at the genomic region. Nevertheless, I've looked at the other transcripts and it seems that it is consistently selecting the longest transcripts. So maybe this is an example instance where the genomic coordinates are misleading for transcript size (i.e. splicing etc.)

I'm sorry for bothering you and thank you for all your help!! :-)

I'll close the issue if you're ok with it :-)