galaxyproject / tools-iuc

Tool Shed repositories maintained by the Intergalactic Utilities Commission
https://galaxyproject.org/iuc
MIT License
160 stars 417 forks source link

Enhancements for multiGPS #1226

Open nekrut opened 7 years ago

nekrut commented 7 years ago

MultiGPS is now on main. It is cutting edge binding site finder that analyzes multiple samples at once and performs differential binding analysis and motif search in one go. It can be used for analysis of ChIP-exo data out of the box. It is developed by @shaunmahony and wrapped by @gregvonkuster.

Before it is fully operational we need to make the following enhancements:

nekrut commented 7 years ago

There are two ways to provide experiment design details to multiGPS:

Perhaps the easiest way is to gather all input into and then write design file within <command> tag.

Here is an example of design file:

#DataFile   Signal/Control  FileFormat  ConditionName   ReplicateName
Cdx2-ES-rep1.bam    signal  BAM ES  1
Cdx2-ES-rep2.bam    signal  BAM ES  2
Cdx2-ES-rep3.bam    signal  BAM ES  3
Input-ES-rep1.bam   control BAM ES
Cdx2-pMN-rep1.bam   signal  BAM pMN 1
Cdx2-pMN-rep2.bam   signal  BAM pMN 2
Input-pMN-rep1.bam  control BAM pMN 1
Input-pMN-rep2.bam  control BAM pMN 2
gregvonkuster commented 7 years ago

I’ve got the first pass of this enhanced tool in the TTS - https://testtoolshed.g2.bx.psu.edu/view/iuc/multigps/91127c200437.

I’ve done only a very small sample of testing, so please let me know if you find problems

I think the new bed file output might be a problem - it currently contains this string:

track name=multiGPS-Cond1 description=multiGPS events Cond1

Please let me know what you would like changed. This tool is at the top of my priority list until Anton approves.

bgruening commented 7 years ago

@gregvonkuster according to https://genome.ucsc.edu/FAQ/FAQformat.html

Please note that only in custom tracks can the first lines of the file consist of header lines, which begin with the word "browser" or "track" to assist the browser in the display and interpretation of the lines of BED data following the headers. Such annotation track header lines are not permissible in downstream utilities such as bedToBigBed, which convert lines of BED text to indexed binary files.

these are track lines and can be removed I think. Do you have a PR to look at?

gregvonkuster commented 7 years ago

PR is here https://github.com/galaxyproject/tools-iuc/pull/1227

gregvonkuster commented 7 years ago

@bgruening My assumption is that the BED output might need to be corrected in the .jar file, but I'm not quite sure...

nekrut commented 7 years ago

So it looks like this for two replicates on the same experiment (2 treatments and 2 inputs):

multigps

I think experiment type should not be input specific, no (ping @shaunmahony)?

nekrut commented 7 years ago

I think that conditions should be the unit of repeat (not Input file). So when you start the tool you have one condition open with two input files: signal and control. You should be able to add replicates to this condition. If you want a completely new condition, you should be able to click on "insert new condition". @gregvonkuster - can you stop by ~ 4pm today so we draw this one the board.

nekrut commented 7 years ago

e.g., look at DeSeq2 interface on main

shaunmahony commented 7 years ago

@gregvonkuster @bgruening: It's an easy change to remove the header line from the output BED. I thought it would be nice to include a label/description, since multiple BED files are produced by the tool, but I'm happy to remove it if it causes issues downstream

gregvonkuster commented 7 years ago

@shaunmahony I may have just been seeing an empty bed file (except for the header) due to my smaller input datasets I was using. Its fine with me to leave the header.

gregvonkuster commented 7 years ago

@nekrut @shaunmahony I can change the UI to have the repeat based on condition rather than input file. I am not sure how to deal with the optional control file for multiple inputs when using a design file. If each of the repeat blocks includes a control file, can I generate a comma-separated list on the command line for the --ctrl option, something like this?

--ctrl file1,file2,file3
gregvonkuster commented 7 years ago

@nekrut @shaunmahony I've got the next pass of this tool on ares.bmb.psu.edu. The UI now has the <repeat> based on the Experiment type - Signal or Control. Each <repeat> block includes both an input (experiment) and an optional control input. I need direction on how to handle the multiple control files on the command line when using a design file - see the above comment.

I'm seeing some job errors in my testing but I'm not sure if they are due to my test datasets or something else.

nekrut commented 7 years ago

Now it does not make sense because "This experiment is" dropdown is now irrelevant.

nekrut commented 7 years ago

Here is a summary of our discussion with @gregvonkuster today:

Galaxy will process the tool form to always generate a control file. In the simplest possible case (one signal and one control) the control file will look like this:

signal_1_file  signal   BAM  1  1
control_file   control  BAM  1

when you have multiple signal replicates it will look like this:

signal_1_file  signal   BAM  1  1
signal_2_file  signal   BAM  1  2
control_file   control  BAM  1

when you have multiple conditions with multiple replicates it will look like this:

tal1_signal_1_file    signal   BAM  Tal1   1
tal1_signal_2_file    signal   BAM  Tal1   2
tal1_control_file     control  BAM  Tal1
gata1_signal_1_file   signal   BAM  Gata1  1
gata1_signal_2_file   signal   BAM  Gata1  2
gata1_control_file    control  BAM  Gata1

However, there are several questions for @shaunmahony:

img_20170404_111657

shaunmahony commented 7 years ago

@nekrut answers:

- when Experiment type for this replicate: chipseq/chipexo. is not chosen, what is the default? chipseq is the default

- Fixed per-base read count limit for this replicate. If this is set to be defined by user, the user should enter an integer, right? Yes; an integer representing the read count limit should be entered.

- can replicate number be optional for signal of there is only one replicate? Yes, replicate name/number is already optional. Note that there are two more complications to consider: 1) if you give more than one signal file (e.g. BAM) for a given condition and you give them all the same replicate name, multiGPS will merge them all into the same replicate. 2) if you give more than one signal file (e.g. BAM) for a given condition and you don't specify any replicate names for any of them, multiGPS will again merge them all into one replicate (with default replicate name "rep1"). You might wonder why I don't treat all signal files as being separate replicates by default - I was thinking here about sequencing replicates (i.e. re-sequencing the same material) where the correct handling is to merge data.

gregvonkuster commented 7 years ago

@nekrut and @shaunmahony, I believe I now have the UI pretty close to what we discussed. Tool execution is till broken, but if you get a chance to take a look at the UI on ares, please let me know if I’ve screwed something up. I’m still working on the tool execution. Thanks! ;)

nekrut commented 7 years ago

@gregvonkuster I think Experiment should be named Signal. Other than that it looks exactly as we've discussed

shaunmahony commented 7 years ago

@gregvonkuster I agree re "Signal" instead of "Experiment" The UI looks great! Is there a way to name the conditions or is it better to leave them as sequential numbers? Also, the description under "Set data scaling parameters" should be: "Default behavior is to scale signal to corresponding controls using the Normalization of ChIP-seq (NCIS) method described by Liang & Keles, BMC Bioinformatics 2012." I think the text you have came from an old version of the code.

nekrut commented 7 years ago

Agree with @shaunmahony : one should be able to name conditions

gregvonkuster commented 7 years ago

@nekrut @shaunmahony I believe I have incorporated all of your current feedback into the UI on ares. I still do not have the tool execution fully working, but I hope that I am close. I will need to consult with @blankenberg to work out the latest problem for the execution.

gregvonkuster commented 7 years ago

@nekrut @shaunmahony The tool now executes and produces "green" outputs, but the stdout states incorrect lines in design files. I'm looking into that now.

nekrut commented 7 years ago

@gregvonkuster let me know when to test

gregvonkuster commented 7 years ago

@nekrut Go ahead and test. The output datasets look right to me. I have messed around with the generation of the design file, but I have not been able to get the stdout message "Incorrectly formatted line in design file" eliminated. I believe it is due to empty lines being generated in the design file by the Galaxy <configfiles> tag set handler, but I'll have to look further to find the cause. From what I've seen, MultiGPS just outputs the warning, but processing is correct. Let me know what you discover.

gregvonkuster commented 7 years ago

The design file looks like this - notice the 2 blank lines at the start:


/home/galaxy-pugh/git_workspace/galaxy/database/files/001/dataset_1005.dat  Signal  SCIDX   cond    rep chipseq 
/home/galaxy-pugh/git_workspace/galaxy/database/files/001/dataset_1015.dat  Control SCIDX   cond        chipseq 
nekrut commented 7 years ago

I bit of time crunch - teaching today :(. Will test around 4pm

blankenberg commented 7 years ago

@gregvonkuster Move the <![CDATA[ and the first #for block directly next to the <configfile> open block. Tried to make a PR, but github keeps blocking me.:

         ]]>
     </command>
     <configfiles>
-        <configfile name="build_design_file">
-<![CDATA[
-#for $condition_items in $condition_repeat:
+        <configfile name="build_design_file"><![CDATA[#for $condition_items in $condition_repeat:
     #for $signal_items in $condition_items.signal_repeat:
         #if str($signal_items.fixedreadcount_cond.fixedreadcount_select) == 'yes':
             #set $frc = $signal_items.fixedreadcount_cond.fixedreadcount
gregvonkuster commented 7 years ago

Ah yes, @blankenberg , and at the end of the CDATA block as well. Thanks - that fixes it! ;)

nekrut commented 7 years ago

Right now you have Condition name within signal or control repeat. It should not be there, it should be one level up (at condition level). The same applies to Experiment type - it is a condition level variable.

I'm running a test now

gregvonkuster commented 7 years ago

@nekrut thanks - fixed and available in TTS. I can update ares as soon as the current multigps tool execution finishes unless its ok to just kill it. This Galaxy instance uses only the local job runner, so stopping/starting the server (required because of multiple web front-end processes) will kill the job.

nekrut commented 7 years ago

you can kill it

gregvonkuster commented 7 years ago

Ok, ares is updated.

nekrut commented 7 years ago

running. two suggestions:

gregvonkuster commented 7 years ago

I don't believe auto-incrementing the default value for replicate name is possible since the default value is static html. This would also be useful for the default for the condition name "cond".

I will add the enhancement to optionally output the design file.

nekrut commented 7 years ago

ok, then make default 1 instead of repl

nekrut commented 7 years ago

Hmmm..trying to run a job throws this:

The server could not complete the request. Please contact the Galaxy Team if this error persists. Error executing tool: (psycopg2.DataError) invalid input syntax for integer: "yes" LINE 3: WHERE history_dataset_association.id = 'yes' ^ [SQL: 'SELECT history_dataset_association.id AS history_dataset_association_id, history_dataset_association.history_id AS history_dataset_association_history_id, history_dataset_association.dataset_id AS history_dataset_association_dataset_id, history_dataset_association.create_time AS history_dataset_association_create_time, history_dataset_association.update_time AS history_dataset_association_update_time, history_dataset_association.state AS history_dataset_association_state, history_dataset_association.copied_from_history_dataset_association_id AS history_dataset_association_copied_from_history_dataset_a_1, history_dataset_association.copied_from_library_dataset_dataset_association_id AS history_dataset_association_copied_from_library_dataset_d_2, history_dataset_association.name AS history_dataset_association_name, history_dataset_association.info AS history_dataset_association_info, history_dataset_association.blurb AS history_dataset_association_blurb, history_dataset_association.peek AS history_dataset_association_peek, history_dataset_association.tool_version AS history_dataset_association_tool_version, history_dataset_association.extension AS history_dataset_association_extension, history_dataset_association.parent_id AS history_dataset_association_parent_id, history_dataset_association.designation AS history_dataset_association_designation, history_dataset_association.deleted AS history_dataset_association_deleted, history_dataset_association.visible AS history_dataset_association_visible, history_dataset_association.extended_metadata_id AS history_dataset_association_extended_metadata_id, history_dataset_association.hid AS history_dataset_association_hid, history_dataset_association.purged AS history_dataset_association_purged, history_dataset_association.hidden_beneath_collection_instance_id AS history_dataset_association_hidden_beneath_collection_ins_3, dataset_1.id AS dataset_1_id, dataset_1.create_time AS dataset_1_create_time, dataset_1.update_time AS dataset_1_update_time, dataset_1.state AS dataset_1_state, dataset_1.deleted AS dataset_1_deleted, dataset_1.purged AS dataset_1_purged, dataset_1.purgable AS dataset_1_purgable, dataset_1.object_store_id AS dataset_1_object_store_id, dataset_1.external_filename AS dataset_1_external_filename, dataset_1._extra_files_path AS dataset_1__extra_files_path, dataset_1.file_size AS dataset_1_file_size, dataset_1.total_size AS dataset_1_total_size, dataset_1.uuid AS dataset_1_uuid \nFROM history_dataset_association LEFT OUTER JOIN dataset AS dataset_1 ON dataset_1.id = history_dataset_association.dataset_id \nWHERE history_dataset_association.id = %(param_1)s'] [parameters: {'param_1': u'yes'}]
gregvonkuster commented 7 years ago

Yes, this happens if you set "Output design file" to "yes". Leave it "no" for now until I figure this out and I think things will work.

nekrut commented 7 years ago

ok. running

gregvonkuster commented 7 years ago

@nekrut I noticed your job ended (hopefully successfully) on ares, so I updated Galaxy with the latest fixes. The optional save design file feature is now working. I also upgraded the local job runner configuration, so jobs should finish faster in case you want to do more testing. I will remove the WIP label on this PR and update the MultiGPS tool in the MTS as soon as you let me know to do so.