drmjc / lumidat

An R package for processing Illumina gene expression idat files
Other
2 stars 3 forks source link

Too many IDATs #2

Open AndrewSkelton opened 9 years ago

AndrewSkelton commented 9 years ago

Hi, I tried out lumidat on a large number of IDAT files (~200) and I get truncation errors. I believe this might be because the resulting java call will be extremely long.

Maybe worth doing one java call per array (as in 12 arrays on a HT-12), make the sample probe profile and control probe profile, then combine them in some way at the end? Or do all files need to be loaded in at once?

drmjc commented 9 years ago

Hi Andrew, Thanks for the feedback. I think you might be right there. I know I've run about 100 through though, so you should be able to get a bit more than 27 out! Are you able to make the paths to your files shorter, eg by changing your working dir to something closer to the files?

It might be a while before I can look into this, so please report back if you're able to break your job into smaller batches and then merge the expressiondata objects in R using cbind.

Cheers, Mark On Sat, 29 Nov 2014 at 2:18 am, Andrew Skelton notifications@github.com wrote:

Hi, I tried out lumidat on a large number of IDAT files (~200) and I get truncation errors. I believe this might be because the resulting java call will be extremely long.

  • I get ~27 samples in the resultant lumibatch object

— Reply to this email directly or view it on GitHub https://github.com/drmjc/lumidat/issues/2.

AndrewSkelton commented 9 years ago

Hi Mark,

I ran an isolated run of the compiled jar with everything in the same directory (thus, minimising the path lengths). That ran fine, (although I had to give java ~8GB, but that's by the by), with 2 warnings:

WARNING, retained 4188 probes with low numbers of beads. These may cause havoc in downstream analysis, like the lumi pipeline.

WARNING, retained 81 probes with low numbers of beads. These may cause havoc in downstream analysis, like the lumi pipeline.

Wrote: Sample Probe Profile.txt

Wrote: Control Probe Profile.txt

I thought that was a bit unusual that it threw two of those warnings, thoughts?

With regards to combining multiple outputs, the sample probe profile shouldn't be an issue, as you can give lumiR more than one and it'll do all the combining. I'm not sure about how to go about combining the control probe profile files, but I'll take a look.

drmjc commented 9 years ago

Hi Andrew, thanks for the update, that's great.

That message is printed by the writeFinalOutput method, and is thus being produced during the creation of both the sample probes and control probes files.

If using the java implementation, there are two ways for handling large numbers of arrays: a large Zip file of all iDAT's, or by sending filenames to stdin:

{code} $ java -jar lumidat-1.2.2.jar Welcome to lumidat, version 1.2.2. Mark Cowley, Garvan Institute of Medical Research (2013).

ERROR: no input files identified.

usage: java [-Xmx1024m] -jar lumidat-1.2.2.jar -inputfile idat.zip java -jar lumidat-1.2.2.jar file1.idat [file2.idat ...] cat idat.files.txt | java -jar lumidat-1.2.2.jar lumidat - Options are: -allProbes Include all probes, even if they have very low numbers of beads -bg Determines whether or not to perform background subtraction. Valid values are 'true' and 'false'. The default behavior is not to perform background subtraction. NOTE: this is an experimental feature. -clmfile CLM file associating IDAT file names to sample names Expects a path to the file; the option is ignored if no path is provided -collapse Collapse Probes to one value for each gene using the given mode. Valid values are 'max', 'median', 'mean' and 'none'. The default behavior is 'none' (do not collapse). -h Print help for this application. -inputfile ZIP file containing the IDAT files. Alternatively, you can just pass the paths to at least 1 idat file(s) as the final argument(s). Finally, you can tell lumidat to read files from STDIN, by specifying a single argument of '-' -manifestfile The text version of the manifest, NOT the BGX version. This must match the array used to generate the IDATs. -manifesturl The URL to the text version of the manifest. This must match the array used to generate the IDATs. -outputDir The [optional] directory to put the output files. Default is '.' -prefix The [optional] prefix to use for creation of the signal values output file. -probeID Control whether the ProbeID column will contain Illumina ProbeID, ArrayAddressID, NuID or Sequence (ie Probe_Sequence). Valid values are 'ProbeID', 'ArrayAddressID', 'NuID', 'Sequence' GenomeStudio usually writes ArrayAddressID, whereas ExpressionFileCreator writes ProbeID The default behaviour is 'ArrayAddressID'. -quiet Suppress printing messages {code}

and if using the R implementation, the ‘zip.file’ option is available in lumiR.idat or read.ilmn.

A generic solution to this would be to refactor the R interface to send idat paths via stdin to the underlying java process. Given there’s a reasonable workaround already, this is going to be fairly low on my priority list. I’d be very happy to merge a pull request from you if you’re able to do this.

Finally, If you did have to combine batches, i’d just run lumiR.idat on each batch and then cbind the objects.

cheers, Mark

AndrewSkelton commented 9 years ago

is there any source to that jar file??