cgsb / hitscore

NYU CGSB Genomics Core Facility
1 stars 0 forks source link

transfers/backups of the PGM #43

Closed agarwal closed 10 years ago

agarwal commented 11 years ago

Understand the directory structure of the PGM, and create a script to transfer each run's data to Bowery.

smondet commented 11 years ago

Regarding issue 43:

Raw Data

The raw data seems to be in /results/sn25080361/ (sn25080361 is the serial number) as of Fri, 31 May 2013 11:16:52 -0400 it contains 854 GB.

There, each run has a directory R_<date>_<name-of-the-run> like R_2013_05_25_01_52_07_user_SN2-13-Justin_Monica_Seed1_2013-24-05.

So far most runs respect the pattern user_SN2-<nb>-<more-namming> (but not all like R_2013_05_09_10_26_21_user_ION_CONTROL_TEST_RUN or R_2013_01_15_03_18_52_user_2013-01-14_318_Suphir — which should be SN2-3).

The .dat files (which the manual says that it's the raw data) are most often about 50 MB and there are a lot of them.

The run metadata seems to be in explog.txt (written at the beginning of the run) and explog_final.txt (at the end). Those are mostly Key : Value text files.

Analyzed Data

Each run seems to also have a directory in /results/analysis/output/Home/ (315 GB as of Fri, 31 May 2013 11:28:47 -0400).

Depending on configuration, outputting FASTQ files should be explicitly asked from the Web Interface (or set to be run every time, c.f. p. 38). Right now it seems set up with Autorun. Once it's done they appear in

 ./$RUN_NAME/plugin_out/FastqCreator_out/

The Server

It seems to be a running Apache2, some Django, some PHP. some JSP (Apache catalina), PostGreSQL, and even some LaTeX to generate reports (like this one).

The /results/analysis directory seems to be completely served by Apache, the are even some PHP files in the middle of the analyses.

find /results/analysis/output/Home/ -name "*.php" | wc -l
44

For example, when logged as a user I can see the file /results/analysis/output/Home/Auto_user_2013-01-14_318_Suphir_3_003/status.txt at http://pgm1.bio.nyu.edu/output/Home/Auto_user_2013-01-14_318_Suphir_3_003/status.txt

To explore the PostgreSQL database:

psql iondb -U ion

some of it seems to be the default Django tables, but there are more custom ones.

smondet commented 10 years ago
gencore@bowery-0-3:/scratch/gencore/pgm-25080361/rsync_raw $ qsub script_rsync_raw.pbs 
2632963.crunch.local
smondet commented 10 years ago

with

find . -type f -exec md5sum {} >> md5s_2013-10-14 \;

#use "topfind";;
#thread;;
#require "core";;
open Core.Std

let () =
  let file1 = "md5s_2013-10-14" in
  let file2 = "md5s_2013-10-14-torrent-server" in
  let map_of file =
    let open In_channel in
    with_file file ~f:(fun ic ->
        fold_lines ~init:String.Map.empty ic ~f:(fun map line ->
            Scanf.sscanf line "%s %s" (fun data key ->
                Map.add map ~key ~data)
          ))
  in
  let say fmt = ksprintf (eprintf "* %s\n%!") fmt in
  let go map1 map2 =
    Map.iter map1 (fun ~key ~data ->
        match Map.find map2 key with
        | None -> say "file %s not found" key
        | Some s when s = data -> ()
        | Some s -> say "file %s map1: %S map2: %S" key data s)
  in
  let map1 = map_of file1 in
  let map2 = map_of file2 in
  say "iter map1 trying map2:";
  go map1 map2;
  say "iter map2 trying map1:";
  go map2 map1;
  say "Done."
 $ ocaml compare_md5s.ml 
* iter map1 trying map2:
* file ./R_2013_10_09_04_20_19_user_SN2-19/.acq_0598.dat.fC9sF1 not found
* file ./md5s_2013-10-14 not found
* file ./rsync_raw/rsync_raw.stderr not found
* file ./rsync_raw/rsync_raw.stdout not found
* file ./rsync_raw/script_rsync_raw.pbs not found
* iter map2 trying map1:
* Done.