Open nekrut opened 7 years ago
There are two ways to provide experiment design details to multiGPS:
--exptCONDNAME-REPNAME
and --ctrlCONDNAME-REPNAME
command line options--design
filePerhaps the easiest way is to gather all input into and then write design file within <command>
tag.
Here is an example of design file:
#DataFile Signal/Control FileFormat ConditionName ReplicateName
Cdx2-ES-rep1.bam signal BAM ES 1
Cdx2-ES-rep2.bam signal BAM ES 2
Cdx2-ES-rep3.bam signal BAM ES 3
Input-ES-rep1.bam control BAM ES
Cdx2-pMN-rep1.bam signal BAM pMN 1
Cdx2-pMN-rep2.bam signal BAM pMN 2
Input-pMN-rep1.bam control BAM pMN 1
Input-pMN-rep2.bam control BAM pMN 2
I’ve got the first pass of this enhanced tool in the TTS - https://testtoolshed.g2.bx.psu.edu/view/iuc/multigps/91127c200437.
I’ve done only a very small sample of testing, so please let me know if you find problems
I think the new bed file output might be a problem - it currently contains this string:
track name=multiGPS-Cond1 description=multiGPS events Cond1
Please let me know what you would like changed. This tool is at the top of my priority list until Anton approves.
@gregvonkuster according to https://genome.ucsc.edu/FAQ/FAQformat.html
Please note that only in custom tracks can the first lines of the file consist of header lines, which begin with the word "browser" or "track" to assist the browser in the display and interpretation of the lines of BED data following the headers. Such annotation track header lines are not permissible in downstream utilities such as bedToBigBed, which convert lines of BED text to indexed binary files.
these are track lines and can be removed I think. Do you have a PR to look at?
@bgruening My assumption is that the BED output might need to be corrected in the .jar file, but I'm not quite sure...
So it looks like this for two replicates on the same experiment (2 treatments and 2 inputs):
I think experiment type should not be input specific, no (ping @shaunmahony)?
I think that conditions should be the unit of repeat (not Input file). So when you start the tool you have one condition open with two input files: signal and control. You should be able to add replicates to this condition. If you want a completely new condition, you should be able to click on "insert new condition". @gregvonkuster - can you stop by ~ 4pm today so we draw this one the board.
e.g., look at DeSeq2 interface on main
@gregvonkuster @bgruening: It's an easy change to remove the header line from the output BED. I thought it would be nice to include a label/description, since multiple BED files are produced by the tool, but I'm happy to remove it if it causes issues downstream
@shaunmahony I may have just been seeing an empty bed file (except for the header) due to my smaller input datasets I was using. Its fine with me to leave the header.
@nekrut @shaunmahony I can change the UI to have the repeat based on condition rather than input file. I am not sure how to deal with the optional control file for multiple inputs when using a design file. If each of the repeat blocks includes a control file, can I generate a comma-separated list on the command line for the --ctrl option, something like this?
--ctrl file1,file2,file3
@nekrut @shaunmahony I've got the next pass of this tool on ares.bmb.psu.edu. The UI now has the <repeat>
based on the Experiment type - Signal or Control. Each <repeat>
block includes both an input (experiment) and an optional control input. I need direction on how to handle the multiple control files on the command line when using a design file - see the above comment.
I'm seeing some job errors in my testing but I'm not sure if they are due to my test datasets or something else.
Now it does not make sense because "This experiment is" dropdown is now irrelevant.
Here is a summary of our discussion with @gregvonkuster today:
Galaxy will process the tool form to always generate a control file. In the simplest possible case (one signal and one control) the control file will look like this:
signal_1_file signal BAM 1 1
control_file control BAM 1
when you have multiple signal replicates it will look like this:
signal_1_file signal BAM 1 1
signal_2_file signal BAM 1 2
control_file control BAM 1
when you have multiple conditions with multiple replicates it will look like this:
tal1_signal_1_file signal BAM Tal1 1
tal1_signal_2_file signal BAM Tal1 2
tal1_control_file control BAM Tal1
gata1_signal_1_file signal BAM Gata1 1
gata1_signal_2_file signal BAM Gata1 2
gata1_control_file control BAM Gata1
However, there are several questions for @shaunmahony:
Experiment type for this replicate: chipseq/chipexo.
is not chosen, what is the default?Fixed per-base read count limit for this replicate.
If this is set to be defined by user, the user should enter an integer, right?@nekrut answers:
- when Experiment type for this replicate: chipseq/chipexo. is not chosen, what is the default? chipseq is the default
- Fixed per-base read count limit for this replicate. If this is set to be defined by user, the user should enter an integer, right? Yes; an integer representing the read count limit should be entered.
- can replicate number be optional for signal of there is only one replicate? Yes, replicate name/number is already optional. Note that there are two more complications to consider: 1) if you give more than one signal file (e.g. BAM) for a given condition and you give them all the same replicate name, multiGPS will merge them all into the same replicate. 2) if you give more than one signal file (e.g. BAM) for a given condition and you don't specify any replicate names for any of them, multiGPS will again merge them all into one replicate (with default replicate name "rep1"). You might wonder why I don't treat all signal files as being separate replicates by default - I was thinking here about sequencing replicates (i.e. re-sequencing the same material) where the correct handling is to merge data.
@nekrut and @shaunmahony, I believe I now have the UI pretty close to what we discussed. Tool execution is till broken, but if you get a chance to take a look at the UI on ares, please let me know if I’ve screwed something up. I’m still working on the tool execution. Thanks! ;)
@gregvonkuster I think Experiment should be named Signal. Other than that it looks exactly as we've discussed
@gregvonkuster I agree re "Signal" instead of "Experiment" The UI looks great! Is there a way to name the conditions or is it better to leave them as sequential numbers? Also, the description under "Set data scaling parameters" should be: "Default behavior is to scale signal to corresponding controls using the Normalization of ChIP-seq (NCIS) method described by Liang & Keles, BMC Bioinformatics 2012." I think the text you have came from an old version of the code.
Agree with @shaunmahony : one should be able to name conditions
@nekrut @shaunmahony I believe I have incorporated all of your current feedback into the UI on ares. I still do not have the tool execution fully working, but I hope that I am close. I will need to consult with @blankenberg to work out the latest problem for the execution.
@nekrut @shaunmahony The tool now executes and produces "green" outputs, but the stdout states incorrect lines in design files. I'm looking into that now.
@gregvonkuster let me know when to test
@nekrut Go ahead and test. The output datasets look right to me. I have messed around with the generation of the design file, but I have not been able to get the stdout message "Incorrectly formatted line in design file" eliminated. I believe it is due to empty lines being generated in the design file by the Galaxy <configfiles>
tag set handler, but I'll have to look further to find the cause. From what I've seen, MultiGPS just outputs the warning, but processing is correct. Let me know what you discover.
The design file looks like this - notice the 2 blank lines at the start:
/home/galaxy-pugh/git_workspace/galaxy/database/files/001/dataset_1005.dat Signal SCIDX cond rep chipseq
/home/galaxy-pugh/git_workspace/galaxy/database/files/001/dataset_1015.dat Control SCIDX cond chipseq
I bit of time crunch - teaching today :(. Will test around 4pm
@gregvonkuster Move the <![CDATA[
and the first #for
block directly next to the <configfile>
open block. Tried to make a PR, but github keeps blocking me.:
]]>
</command>
<configfiles>
- <configfile name="build_design_file">
-<![CDATA[
-#for $condition_items in $condition_repeat:
+ <configfile name="build_design_file"><![CDATA[#for $condition_items in $condition_repeat:
#for $signal_items in $condition_items.signal_repeat:
#if str($signal_items.fixedreadcount_cond.fixedreadcount_select) == 'yes':
#set $frc = $signal_items.fixedreadcount_cond.fixedreadcount
Ah yes, @blankenberg , and at the end of the CDATA block as well. Thanks - that fixes it! ;)
Right now you have Condition name
within signal or control repeat. It should not be there, it should be one level up (at condition level). The same applies to Experiment type
- it is a condition level variable.
I'm running a test now
@nekrut thanks - fixed and available in TTS. I can update ares as soon as the current multigps tool execution finishes unless its ok to just kill it. This Galaxy instance uses only the local job runner, so stopping/starting the server (required because of multiple web front-end processes) will kill the job.
you can kill it
Ok, ares is updated.
running. two suggestions:
1
in there and if you add another signal it opens with number 2
, and so onI don't believe auto-incrementing the default value for replicate name is possible since the default value is static html. This would also be useful for the default for the condition name "cond".
I will add the enhancement to optionally output the design file.
ok, then make default 1
instead of repl
Hmmm..trying to run a job throws this:
The server could not complete the request. Please contact the Galaxy Team if this error persists. Error executing tool: (psycopg2.DataError) invalid input syntax for integer: "yes" LINE 3: WHERE history_dataset_association.id = 'yes' ^ [SQL: 'SELECT history_dataset_association.id AS history_dataset_association_id, history_dataset_association.history_id AS history_dataset_association_history_id, history_dataset_association.dataset_id AS history_dataset_association_dataset_id, history_dataset_association.create_time AS history_dataset_association_create_time, history_dataset_association.update_time AS history_dataset_association_update_time, history_dataset_association.state AS history_dataset_association_state, history_dataset_association.copied_from_history_dataset_association_id AS history_dataset_association_copied_from_history_dataset_a_1, history_dataset_association.copied_from_library_dataset_dataset_association_id AS history_dataset_association_copied_from_library_dataset_d_2, history_dataset_association.name AS history_dataset_association_name, history_dataset_association.info AS history_dataset_association_info, history_dataset_association.blurb AS history_dataset_association_blurb, history_dataset_association.peek AS history_dataset_association_peek, history_dataset_association.tool_version AS history_dataset_association_tool_version, history_dataset_association.extension AS history_dataset_association_extension, history_dataset_association.parent_id AS history_dataset_association_parent_id, history_dataset_association.designation AS history_dataset_association_designation, history_dataset_association.deleted AS history_dataset_association_deleted, history_dataset_association.visible AS history_dataset_association_visible, history_dataset_association.extended_metadata_id AS history_dataset_association_extended_metadata_id, history_dataset_association.hid AS history_dataset_association_hid, history_dataset_association.purged AS history_dataset_association_purged, history_dataset_association.hidden_beneath_collection_instance_id AS history_dataset_association_hidden_beneath_collection_ins_3, dataset_1.id AS dataset_1_id, dataset_1.create_time AS dataset_1_create_time, dataset_1.update_time AS dataset_1_update_time, dataset_1.state AS dataset_1_state, dataset_1.deleted AS dataset_1_deleted, dataset_1.purged AS dataset_1_purged, dataset_1.purgable AS dataset_1_purgable, dataset_1.object_store_id AS dataset_1_object_store_id, dataset_1.external_filename AS dataset_1_external_filename, dataset_1._extra_files_path AS dataset_1__extra_files_path, dataset_1.file_size AS dataset_1_file_size, dataset_1.total_size AS dataset_1_total_size, dataset_1.uuid AS dataset_1_uuid \nFROM history_dataset_association LEFT OUTER JOIN dataset AS dataset_1 ON dataset_1.id = history_dataset_association.dataset_id \nWHERE history_dataset_association.id = %(param_1)s'] [parameters: {'param_1': u'yes'}]
Yes, this happens if you set "Output design file" to "yes". Leave it "no" for now until I figure this out and I think things will work.
ok. running
@nekrut I noticed your job ended (hopefully successfully) on ares, so I updated Galaxy with the latest fixes. The optional save design file feature is now working. I also upgraded the local job runner configuration, so jobs should finish faster in case you want to do more testing. I will remove the WIP label on this PR and update the MultiGPS tool in the MTS as soon as you let me know to do so.
MultiGPS is now on main. It is cutting edge binding site finder that analyzes multiple samples at once and performs differential binding analysis and motif search in one go. It can be used for analysis of ChIP-exo data out of the box. It is developed by @shaunmahony and wrapped by @gregvonkuster.
Before it is fully operational we need to make the following enhancements: