galaxyproject / idc

Simon's Data Club - Reference data for Galaxy servers
MIT License
9 stars 7 forks source link

Complexity of STAR indices #27

Open lldelisle opened 1 year ago

lldelisle commented 1 year ago

Hi there, The tools rna_star and rna_starsolo use indices produced by the data manager rna_star_index_builder_data_manager. Currently the last version of STAR is 2.7.10b and the earlier version of STAR which makes compatible indices is 2.7.4a released in 2018: https://github.com/alexdobin/STAR/blob/56c9fd59d0cbddd630f9ed2a656dd3a963a1b6b4/source/parametersDefault#L1-L3

This corresponds to rna_star_index_builder_data_manager revision 9:c520a52b5174 version 2.7.4a and 11:d63c1442407f version 2.7.4a+galaxy1.

Currently, in data_manager.yml we have: https://github.com/galaxyproject/idc/blob/4aee221c69daebd09e602159d7a6a2911d339d1f/data_managers.yml#L34

which were generating a table called rnastar_index2.loc (the new version of the table is rnastar_index2x_versioned.loc):

$ head /data/galaxy/galaxy/var/shed_tools/toolshed.g2.bx.psu.edu/repos/iuc/data_manager_star_index_builder/6ef6520f14fc/data_manager_star_index_builder/data_manager/rna_star_index_builder.xml 
<tool id="rna_star_index_builder_data_manager" name="rnastar index2" tool_type="manage_data" version="0.0.5" profile="17.01">
    <description>builder</description>

    <macros>
        <import>macros.xml</import>
    </macros>

    <expand macro="requirements" />

    <command><![CDATA[
$ cat /data/galaxy/galaxy/var/shed_tools/toolshed.g2.bx.psu.edu/repos/iuc/data_manager_star_index_builder/6ef6520f14fc/data_manager_star_index_builder/tool-data/rnastar_index2.loc.sample 
#This is a sample file distributed with Galaxy that enables tools
#to use a directory of rna-star indexed sequences data files. You will
#need to create these data files and then create a rnastar_index2.loc
#file similar to this one (store it in this directory) that points to
#the directories in which those files are stored. The rnastar_index2.loc
#file has this format (longer white space characters are TAB characters):
#
#<unique_build_id>   <dbkey>   <display_name>   <file_base_path>        <with-gtf>
#
#The <with-gtf> column should be 1 or 0, indicating whether the index was made
#with an annotation (i.e., --sjdbGTFfile and --sjdbOverhang were used) or not,
#respecively.
#
#Note that STAR indices can become quite large. Consequently, it is only
#advisable to create indices with annotations if it's known ahead of time that
#(A) the annotations won't be frequently updated and (B) the read lengths used
#will also rarely vary. If either of these is not the case, it's advisable to
#create indices without annotations and then specify an annotation file and
#maximum read length (minus 1) when running STAR.
#
#hg19   hg19    hg19 full   /mnt/galaxyIndices/genomes/hg19/rnastar     0
#hg19Ensembl   hg19Ensembl    hg19 full with Ensembl annotation   /mnt/galaxyIndices/genomes/hg19Ensembl/rnastar        1

$ cat /data/galaxy/galaxy/var/shed_tools/toolshed.g2.bx.psu.edu/repos/iuc/data_manager_star_index_builder/6ef6520f14fc/data_manager_star_index_builder/tool_data_table_conf.xml.sample 
<tables>
    <!-- Locations of all fasta files under genome directory -->
    <table name="all_fasta" comment_char="#" allow_duplicate_entries="False">
        <columns>value, dbkey, name, path</columns>
        <file path="tool-data/all_fasta.loc" />
    </table>
    <!-- Locations of indexes in the BWA mapper format -->
    <table name="rnastar_index2" comment_char="#" allow_duplicate_entries="False">
        <columns>value, dbkey, name, path, with-gtf</columns>
        <file path="tool-data/rnastar_index2.loc" />
    </table>
</tables>

However, when I check in the cvmfs cache, I have a table named rnastar_index2x_versioned.loc and the version inside is 2.7.4a.

:upside_down_face:

After this big introduction, I have 2 questions:

Thank you so much.

wm75 commented 1 year ago

The last column in the rnastar_index2x_versioned data table is the minimal version of RNAStar required to work with that index, and that info is used by the RNAStar tool wrapper to offer only compatible index versions via this macro part:

https://github.com/galaxyproject/tools-iuc/blob/f93f921e3c2d0002ff0c152d90b9221533ad22e9/tools/rgrnastar/macros.xml#L1-L19

So 2.7.4a is up to date still (the RNAStar developer was playing around with the index structure around the switch from 2.6 -> 2.7 but has left it untouched since as far as I know).

lldelisle commented 1 year ago

For me this is unclear. If you want to use STAR version 2.7.1 you can use indices from 2.7.1 but not 2.7.4a, right? If you want to use STAR version 2.7.10 you can use indices from 2.7.4a AND 2.7.1? (In galaxy this is not possible, we restrict to exact same version: https://github.com/galaxyproject/tools-iuc/blob/f93f921e3c2d0002ff0c152d90b9221533ad22e9/tools/rgrnastar/macros.xml#L46

wm75 commented 1 year ago

If you want to use STAR version 2.7.1 you can use indices from 2.7.1 but not 2.7.4a, right?

Yes, that's correct.

If you want to use STAR version 2.7.10 you can use indices from 2.7.4a AND 2.7.1?

No, you cannot. There is no backwards compatibility in later versions so you can only use 2.7.4a indices.

wm75 commented 1 year ago

So the cvmfs content is up to date. Regarding your second question: I don't know if there is much of a use case for indices before 2.7.4a. These would be really old versions of RNAStar and people can still have the index built on the fly if they really want to use such an old version. Do you have a specific reason why you'd be interested in the older index format?

lldelisle commented 1 year ago

OK Then we definitely needs to use rna_star_index_builder_data_manager revision 9:c520a52b5174 version 2.7.4a or 11:d63c1442407f version 2.7.4a+galaxy1 instead of the 0.0.5 and I still don't understand why in cvmfs you have up to date indices while the manager version in the yml is 0.05... @natefoo , did you use the CI to generate them?

lldelisle commented 1 year ago

Do you have a specific reason why you'd be interested in the older index format?

No you are right.

wm75 commented 1 year ago

Currently, in data_manager.yml we have:

idc/data_managers.yml

tool_id: 'toolshed.g2.bx.psu.edu/repos/iuc/data_manager_star_index_builder/rna_star_index_builder_data_manager/0.0.5'

which were generating a table called rnastar_index2.loc (the new version of the table is rnastar_index2x_versioned.loc)

Ah, had missed that part. Yes, this looks just outdated.

natefoo commented 1 year ago

The version we installed is actually 2.7.4a+galaxy1 - it looks like we ignore the version in the YAML and install the latest (which we shouldn't do and I will fix, but has worked out in our favor in this case).

The reason we installed the latest version is that we had to install the DMs through galaxyproject/usegalaxy-tools, and I generated the lock file with the latest versions there.