dputhier / libgtftk

gtftk C Library and program
GNU General Public License v3.0
3 stars 2 forks source link

select_trancript (select_most_5p_tx) is not returning the most 5' tx #80

Closed dputhier closed 6 years ago

dputhier commented 6 years ago

Hi, Working a little bit on improving tests I realized that select_trancript with argument 3 is not selecting transcript with the most 5' TSS.

Without call to select_most_5p_tx. Selecting gene ISG15 we obtain 3 transcripts.

   gtftk get_example -d mini_real  | gtftk select_by_key -k gene_name -v ISG15 | gtftk select_by_key --select-transcripts | gtftk 5p_3p_coord
   chr1 1001137 1001138 ENSG00000187608|ENST00000624697 0   +
   chr1 1001144 1001145 ENSG00000187608|ENST00000624652 0   +
   chr1 1013422 1013423 ENSG00000187608|ENST00000379389 0   +

Now we also ask for the most 5':

   gtftk get_example -d mini_real  | gtftk select_most_5p_tx |  gtftk select_by_key -k gene_name -v ISG15 | gtftk select_by_key --select-transcripts | gtftk 5p_3p_coor       
   chr1 1013422 1013423 ENSG00000187608|ENST00000379389 0   +

Unfortunately, on forward strand, we expect ENST00000624697.

fafa13 commented 6 years ago

I pushed a fix for this issue. Should work now.

dputhier commented 6 years ago

OK. Last tests were OK. There are still some issues. For instance, using RAB13:

    $ gtftk get_example -d mini_real  | gtftk select_by_key -k gene_name -v RAB13 | gtftk select_by_key --select-transcripts | gtftk 5p_3p_coord  -t transcript | bedtools sort

    chr1    153985361   153985362   ENSG00000143545|ENST00000484297 0   -
    chr1    153985365   153985366   ENSG00000143545|ENST00000614713 0   -
    chr1    153986245   153986246   ENSG00000143545|ENST00000462680 0   -
    chr1    153986351   153986352   ENSG00000143545|ENST00000368575 0   -
    chr1    153986357   153986358   ENSG00000143545|ENST00000495720 0   -

Unfortunatly, calling select_most_5p_tx, we don't get the most 5'.

    $ gtftk get_example -d mini_real  | gtftk select_most_5p_tx | gtftk select_by_key -k gene_name -v RAB13 | gtftk select_by_key --select-transcripts | gtftk 5p_3p_coord  -t transcript  | bedtools sort
    chr1    153985365   153985366   ENSG00000143545|ENST00000614713 0   -
dputhier commented 6 years ago

The same problem is encountered on CRABP2.

dputhier commented 6 years ago

Maybe it is not strand specific ?

dputhier commented 6 years ago

Also, what is the rule in case of Ties ?

dputhier commented 6 years ago

Hi Fa, Did you have any time to have a look to this one ? Best

fafa13 commented 6 years ago

Hello, I'm back to GTFTK. I'll try to look at that this afternoon.

fafa13 commented 6 years ago

I suppressed select_transcript2 in the last push. It works now on my conda environment. Can you try it ?

dputhier commented 6 years ago

I have to replace calls to select_transcript2 by calls to select_transcript ?

fafa13 commented 6 years ago

Yes !

dputhier commented 6 years ago

I tried to include the last libgtftk version but I got the following message:

  gtftk get_example -d mini_real  | gtftk feature_size -t mature_rna

Exception AttributeError: AttributeError("function/symbol 'free_gtf_data' not found in library '/usr/local/lib/libgtftk.so': dlsym(0x7fddfd115170, free_gtf_data): symbol not found",) in <object repr() failed> ignored
Traceback (most recent call last):
  File "/Users/puthier/miniconda2/envs/pygtftk/bin/gtftk", line 4, in <module>
    __import__('pkg_resources').run_script('pygtftk==0.8.0.dev0+9422', 'gtftk')
  File "/Users/puthier/miniconda2/envs/pygtftk/lib/python2.7/site-packages/pkg_resources/__init__.py", line 657, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/Users/puthier/miniconda2/envs/pygtftk/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1437, in run_script
    exec(code, namespace, namespace)
  File "/Users/puthier/miniconda2/envs/pygtftk/lib/python2.7/site-packages/pygtftk-0.8.0.dev0+9422-py2.7-macosx-10.9-x86_64.egg/EGG-INFO/scripts/gtftk", line 77, in <module>
    args = main()
  File "/Users/puthier/miniconda2/envs/pygtftk/lib/python2.7/site-packages/pygtftk-0.8.0.dev0+9422-py2.7-macosx-10.9-x86_64.egg/EGG-INFO/scripts/gtftk", line 63, in main
    CmdManager.run(args)
  File "/Users/puthier/miniconda2/envs/pygtftk/lib/python2.7/site-packages/pygtftk-0.8.0.dev0+9422-py2.7-macosx-10.9-x86_64.egg/pygtftk/cmd_manager.py", line 897, in run
    fun(**args)
  File "/Users/puthier/miniconda2/envs/pygtftk/lib/python2.7/site-packages/pygtftk-0.8.0.dev0+9422-py2.7-macosx-10.9-x86_64.egg/pygtftk/plugins/feat_size.py", line 99, in feature_size
    gtf = GTF(inputfile)
  File "/Users/puthier/miniconda2/envs/pygtftk/lib/python2.7/site-packages/pygtftk-0.8.0.dev0+9422-py2.7-macosx-10.9-x86_64.egg/pygtftk/gtf_interface.py", line 442, in __init__
    for i in tab:
  File "/Users/puthier/miniconda2/envs/pygtftk/lib/python2.7/site-packages/pygtftk-0.8.0.dev0+9422-py2.7-macosx-10.9-x86_64.egg/pygtftk/gtf_interface.py", line 1096, in extract_data_iter_list
    ptr = self._dll.extract_data(self._data, keys_csv, base, nr)
  File "/Users/puthier/miniconda2/envs/pygtftk/lib/python2.7/site-packages/cffi/api.py", line 882, in __getattr__
    make_accessor(name)
  File "/Users/puthier/miniconda2/envs/pygtftk/lib/python2.7/site-packages/cffi/api.py", line 878, in make_accessor
    accessors[name](name)
  File "/Users/puthier/miniconda2/envs/pygtftk/lib/python2.7/site-packages/cffi/api.py", line 808, in accessor_function
    value = backendlib.load_function(BType, name)
AttributeError: function/symbol 'extract_data' not found in library '/usr/local/lib/libgtftk.so': dlsym(0x7fb78693d350, extract_data): symbol not found
Exception AttributeError: AttributeError("function/symbol 'free_gtf_data' not found in library '/usr/local/lib/libgtftk.so': dlsym(0x7fb78693d350, free_gtf_data): symbol not found",) in <bound method GTF.__del__ of 
- A GTF object containing 0 lines:
- File connection: -
dputhier commented 6 years ago

I deleted this library (/usr/local/lib/libgtftk.so) which was, I guess, a relic from an old installation (I'm under a conda env). Maybe it was just indicating that something went wrong upon compilation. Now the message is:

    gtftk -h
        |--- 17:26-WARNING : Failed to load plugin :/Users/puthier/miniconda2/envs/pygtftk/lib/python2.7/site-packages/pygtftk-0.8.0.dev0+9422-py2.7-macosx-10.9-x86_64.egg/pygtftk/plugins/5p_3p_coords.py
    cannot load library '/Users/puthier/miniconda2/envs/pygtftk/lib/python2.7/site-packages/pygtftk-0.8.0.dev0+9422-py2.7-macosx-10.9-x86_64.egg/pygtftk/lib/libgtftk.so': dlopen(/Users/puthier/miniconda2/envs/pygtftk/lib/python2.7/site-packages/pygtftk-0.8.0.dev0+9422-py2.7-macosx-10.9-x86_64.egg/pygtftk/lib/libgtftk.so, 2): Symbol not found: _get_blast_header

This seem to be related to this new function related to blast parsing.

fafa13 commented 6 years ago

I don't have this problem on blackflag in a conda environment.

dputhier commented 6 years ago

Can you clone a new repo and test ?

fafa13 commented 6 years ago

Works fine. But when I run "gtftk -h", I have this message :

Python is not installed as a framework. The Mac OS X backend will not be able to function correctly if Python is not installed as a framework. See the Python documentation for more information on installing Python as a framework on Mac OS X. Please either reinstall Python as a framework, or try one of the other backends. If you are using (Ana)Conda please install python.app and replace the use of 'python' with 'pythonw'. See 'Working with Matplotlib on OSX' in the Matplotlib FAQ for more information.

I just have to reinstall matplotlib (conda install matplotlib) to remove the message.

dputhier commented 6 years ago

I will have a look for matplotlib (now required as R dependencies have been replaced by Python code). Concerning "select_transcript" update, can you go to "feature_libgtftk_h80" branch. It contains the last update of libgtftk. Best

fafa13 commented 6 years ago

I just did that and it still works fine ! "gtftk get_example -d mini_real | gtftk feature_size -t mature_rna" runs like a charm !

dputhier commented 6 years ago

Hum... I clone it too and it seems ~ OK. The problem is that several test are not successfuly executed. I have to check.

dputhier commented 6 years ago

Ah, OK. Now I get some additional lines with "?". Nice. I will have to deal with it.

fafa13 commented 6 years ago

Yes a lot of test fail because of "?". Especially, you have to update "tabulate".

The command gtftk get_example -d mini_real | gtftk select_most_5p_tx | gtftk select_by_key -k gene_name -v RAB13 | gtftk select_by_key --select-transcripts | gtftk 5p_3p_coord -t transcript | bedtools sort

you put at the beginning of this issue doesn't work because "gtftk get_example -d mini_real" give a GTF without gene lines. "select_most_5p_tx" function needs gene lines to perform. So if you add a "gtftk convert_ensembl" before calling "gtftk select_most_5p_tx", that works fine.

dputhier commented 6 years ago

This one is fixed in cd2bda89901d8d7de926148c6df8260bc7640393