genome / gms

The Genome Modeling System installer
https://github.com/genome/gms/wiki
GNU Lesser General Public License v3.0
78 stars 22 forks source link

Create a simple sample importer for users to get their data into the standalone GMS #57

Closed malachig closed 10 years ago

malachig commented 10 years ago

Some work has started here: Genome/InstrumentData/Command/Export/Samplesheet.pm

Need to decide how to handle the actual BAM files, not just the metadata associated with instrument data, samples, libraries, individuals, etc.

We will still need to create: Genome/InstrumentData/Command/Import/Samplesheet.pm

ghost commented 10 years ago

I see that Scott is currently assigned this issue. I also plan to work on a solution to the data import problem. As first steps, I'm looking at ways that we currently get data into the system, such as the .dat file, the LIMS/Apipe bridge at TGI, and the aforementioned G:ID:C:Export:SampleSheet.

ghost commented 10 years ago

Status By the end of last week, I had created a vagrant box that has sGMS installed, but with no data imported (ie, make finished successfully, but no further setup steps had been run). So now I'm actively working on this task to create a simple way for reviewers to import arbitrary samples.

I've tried a couple times at this point to run the current importer genome model import metadata, both times it failed to succeed. The import gets hung up on git cloning the Cosmic database repo, which is the first thing that the current importer tries to do.

Increasing the amount of RAM available to my VM from 1GB to 2GB has fixed the problem with importing Cosmic. Git defaults the configuration variable core.packedGitWindowSize to 1GB on a 64-bit machine. So I used this as my guide. I also believe that reducing core.packedGitWindowSize to a smaller value (perhaps 512MB) would have also fixed the problem with cloning Cosmic.

Plan My first pass approach is to create a bash script of genome commands that replaces importing the .dat file. I think this is good approach because it achieves the same end goal as the current importer, while being an implementation which teaches to the end-user how to use simple existing commands to get their data into the system. However, it is possible that the end result would expose more complexity to the end user than is desirable.

This week I would like to have a proof-of-concept bash script that "imports" many of the data objects present in the .dat file, but in a format that is less dense than the .dat file, with the hope that the end result is a script that could be modified by end users to import their own samples. This exercise will expose areas of the genome command line interface that are not sufficient for end users to get their data into sGMS.

Once the first pass of this script has been written, I think we'll have enough information to decide whether we should continue down this path, or try another solution.

malachig commented 10 years ago

Yes, lets go ahead and try the 'bash script' approach. And see how it looks in an early version.

sakoht commented 10 years ago

Awesome!

On Jan 28, 2014, at 1:49 PM, mkiwala-g notifications@github.com wrote:

Status By the end of last week, I had created a vagrant box that has sGMS installed, but with no data imported (ie, make finished successfully, but no further setup steps had been run). So now I'm actively working on this task to create a simple way for reviewers to import arbitrary samples.

I've tried a couple times at this point to run the current importer genome model import metadata, both times it failed to succeed. The import gets hung up on git cloning the Cosmic database repo, which is the first thing that the current importer tries to do.

Increasing the amount of RAM available to my VM from 1GB to 2GB has fixed the problem with importing Cosmic. Git defaults the configuration variable core.packedGitWindowSize to 1GB on a 64-bit machine. So I used this as my guide. I also believe that reducing core.packedGitWindowSize to a smaller value (perhaps 512MB) would have also fixed the problem with cloning Cosmic.

Plan My first pass approach is to create a bash script of genome commands that replaces importing the .dat file. I think this is good approach because it achieves the same end goal as the current importer, while being an implementation which teaches to the end-user how to use simple existing commands to get their data into the system. However, it is possible that the end result would expose more complexity to the end user than is desirable.

This week I would like to have a proof-of-concept bash script that "imports" many of the data objects present in the .dat file, but in a format that is less dense than the .dat file, with the hope that the end result is a script that could be modified by end users to import their own samples. This exercise will expose areas of the genome command line interface that are not sufficient for end users to get their data into sGMS.

Once the first pass of this script has been written, I think we'll have enough information to decide whether we should continue down this path, or try another solution.

— Reply to this email directly or view it on GitHub.

ghost commented 10 years ago

Once an end user has imported their data, what pipelines should they expect to run? I believe the answer to this question will help direct my focus towards exactly the kinds of information that need to be imported.

Right now I'm considering the creation of:

It is possible that there are other important things that users need to be able to import. Are there other entities which the user needs to be able to create in the database in order to run the pipelines that you would like for them to run?

gatoravi commented 10 years ago

The goal so far as I understand is to get to a successful build of a ClinSeq model for HCC1395, this means we need all the underlying models of Clinseq to have successful builds,

These are,

  1. genotype array models for normal and tumor
  2. exome-ref-align models for normal, tumor
  3. WGS-ref-align models for normal, tumor
  4. exome-somatic variation model
  5. WGS-somatic variation model
  6. rna-seq models for normal, tumor
  7. differential expression model
  8. Clinseq.

@sakoht and @malachig would be able to comment more on the entities that need to be created in the databases. We do import some software results from within TGI(for shortcutting the ref-align-index steps for example) but I'm not sure how they are stored in the db.

sakoht commented 10 years ago

Yep!

On Wed, Feb 12, 2014 at 10:14 AM, Avinash Ramu notifications@github.comwrote:

The goal so far as I understand is to get to a successful build of a ClinSeq model for HCC1395, this means we need all the underlying models of Clinseq to have successful builds,

These are,

  1. genotype array models for normal and tumor
  2. exome-ref-align models for normal, tumor
  3. WGS-ref-align models for normal, tumor
  4. exome-somatic variation model
  5. WGS-somatic variation model
  6. rna-seq models for normal, tumor
  7. differential expression model
  8. Clinseq.

@sakoht https://github.com/sakoht and @malachighttps://github.com/malachigwould be able to comment more on the entities that need to be created in the databases. We do import some software results from within TGI(for shortcutting the ref-align-index steps for example) but I'm not sure how they are stored in the db.

Reply to this email directly or view it on GitHubhttps://github.com/genome/gms/issues/57#issuecomment-34897881 .

ghost commented 10 years ago

These are,

genotype array models for normal and tumor exome-ref-align models for normal, tumor WGS-ref-align models for normal, tumor exome-somatic variation model WGS-somatic variation model rna-seq models for normal, tumor differential expression model Clinseq.

Do any of these models strictly require microarray data? I understand that some models may optionally accept microarray data to enable QC. Right now we do not have a mechanism to import microarray data.

malachig commented 10 years ago

No. They are an optional but desired input for reference alignment. If used there, they will also be used in clinseq for cnv analysis.

Malachi On Feb 18, 2014 8:41 AM, "mkiwala-g" notifications@github.com wrote:

These are,

genotype array models for normal and tumor exome-ref-align models for normal, tumor WGS-ref-align models for normal, tumor exome-somatic variation model WGS-somatic variation model rna-seq models for normal, tumor differential expression model Clinseq.

Do any of these models strictly require microarray data? I understand that some models may optionally accept microarray data to enable QC. Right now we do not have a mechanism to import microarray data.

Reply to this email directly or view it on GitHubhttps://github.com/genome/gms/issues/57#issuecomment-35389869 .

ghost commented 10 years ago

I've developed a script that more or less replaces many functions of the metadata import. The script combined with forthcoming wiki documentation is intended to elucidate for a reviewer how to bring additional samples and data into the system. It does this by translating the cryptic code of the .dat file into less cryptic genome commands that end users would run in the normal operation of GMS.

The script does:

Because the script is based on the same dataset that is imported during genome model import metadata, I've removed some code from the metadata .dat file. My metadata .dat file is 220 lines compared to 766 in the original file. The lines removed include disk allocations, instrument data and their attributes, libraries, models and their inputs, the individual, the samples, and the attributes of the individual and samples.

The script as written does not do some things:

At this point, I'm ready to test running builds against the data and models imported and defined by this script. Avi is installing gms on the blade so that I can do this. I'm also ready to start documenting on the wiki. The script and the comments will be the basis for the documentation effort.

Differential Expression Model Trouble

Here is how I attempt to define the de model:

genome model define differential-expression                                     \
    --model-name="$MODEL_DIFFERENTIAL_EXPRESSION"                               \
    --subject="$INDIVIDUAL"                                                     \
    --processing-profile="$PROCESSING_PROFILE_DIFF_EXP"                         \
    --annotation-build="model_name=$GENOME_BUILD_ANNOTATION"                    \
    --reference-sequence-build="model_name=$GENOME_BUILD_REFERENCE"             \
    --condition-labels-string='normal,tumor'                                    \
    --condition-model-ids-string="$MODEL_NORMAL_RNASEQ_ID,$MODEL_TUMOR_RNASEQ_ID"

Look at the last parameter. --condition-model-ids-string is set to a values of two ids, eg a96e3b696e3b4c3186a6a16914e49599,72a79b9b8659471b80b5ddfdb1461b05. Which gives the error:

ERROR: The input model a96e3b696e3b4c3186a6a16914e49599 does not have a succeeded build!

Since no builds have been run at this point, the de model cannot be defined. We can either leave it like this, or modify genome model define differential-expression to accept model ids for which there are no builds.

Also, genome model define differential-expression is the only command in this script which must know the actual ID of something. Everywhere else, I was able to use friendlier names -- such as the model_name -- to refer to an entity that has already been created. This is a (minor) problem because after creating the rnaseq models, I have to execute genome model list --show id commands to find out the ID of the models so that the de model can be created. So if we fix the above problem with requiring the build, it may be worthwhile to allow a model_name instead insisting on the ID.

How would you like to proceed with regard to the de model?

ghost commented 10 years ago

I discovered on Friday that version of the instrument data importer included in gms-pub since the merge is much different from the version used before the merge. I have some changes to the instrument data importer to allow specifying the library as an argument during import.

Testing this morning with the new importer code, it has trouble running seq-grind because seq-grind is not installed (or installed where the importer expects it to be).

ghost commented 10 years ago

It looks like seq-grind is something we package:

mkiwala@linus82:genome[gms-pub] $ apt-cache policy seqgrind0.1.0 seqgrind0.1.0: Installed: 0.1.0 Candidate: 0.1.0 Version table: *\ 0.1.0 0 1001 http://repo.gsc.wustl.edu/ubuntu/ lucid-genome-development/main Packages 100 /var/lib/dpkg/status

malachig commented 10 years ago

The good news is that this worked:

#add to apt sources:
#deb http://repo.gsc.wustl.edu/ubuntu lucid-genome-development main non-free
sudo apt-get update
sudo apt-cache policy seqgrind*
sudo apt-get install seqgrind0.1.0
sudo apt-cache policy seqgrind0.1.0

@gatoravi has added this to genome-snapshot-deps. Refer to issue #133 for updates regarding this.

ghost commented 10 years ago

Installing the lucide seqgrind package solved the problem with running the importer.

ghost commented 10 years ago

After importing data, I tried to create models and start builds using the genome model clin-seq update-analysis command. It is not working for me, I get errors such as:

Found a DNA sample H_NJ-HCC1395-HCC1395_BLds (normal) matching tissue type: normal Could not find any wgs data ERROR: Did not find a matching DNA sample for tissue type: tumor|met|post treatment|recurrence met|pre-treatment met|pin lesion|relapse|xenogra ft|pre-resistant|post-resistant

Defining the model by hand works, but starting the build fails when trying to create an lsf job:

ssmith@blade16-4-16 ~/mkiwala> genome model build start 3a860cc461d74ca48abeb9387188258c 'models' may require verification... Resolving parameter 'models' from command argument '3a860cc461d74ca48abeb9387188258c'... found 1 Trying to start #1: hcc1395-normal-rnaseq-ds (3a860cc461d74ca48abeb9387188258c)... ERROR: Failed to launch bsub: The 'queue' parameter ("workflow") to Genome::Sys::LSF::bsub::_args did not pass the 'valid > LSF queue' callback at /opt/gms/AXXXB55/sw/genome/lib/perl/Genome/Sys/LSF/bsub.pm line 60 Genome::Sys::LSF::bsub::_args('email', 'gmsuser@example.com', 'err_file', '/opt/gms/AXXXB55/fs/AXXXB55/info/model_data/3a860cc461d74ca48...', 'hold_job', 1, 'log_file', '/opt/gms/AXXXB55/fs/AXXXB55/info/model_data/3a860cc461d74ca48...', 'project', ...) called > at /opt/gms/AXXXB55/sw/genome/lib/perl/Genome/Sys/LSF/bsub.pm line 38

ghost commented 10 years ago

Queueing the build fails because the build wants to put the job in the "workflow" queue, but no queue named "workflow" is defined on blade16-4-16.

ghost commented 10 years ago

pulling the latest gms and rebuilding openlava fixed this problem

ghost commented 10 years ago

I have not yet solved the problems I've had with running clinseq update-analysis on the imported data set.

I've restarted builds today on the imported data because the dataset I was using ran into the strelka problem, so it was likely the old dataset. The currently running builds are running from instrument data freshly copied in.

ghost commented 10 years ago

clinseq update-analysis filters out instrument data in the Solexa subclass. The data imported by this tool is in the "Imported" subclass. Is ok to remove the filter for Solexa only reads?

After importing data, I tried to create models and start builds using the genome model clin-seq update-analysis command. It is not working for me, I get errors such as:

Found a DNA sample H_NJ-HCC1395-HCC1395_BLds (normal) matching tissue type: normal Could not find any wgs data ERROR: Did not find a matching DNA sample for tissue type: tumor|met|post treatment|recurrence met|pre-treatment met|pin lesion|relapse|xenogra ft|pre-resistant|post-resistant

malachig commented 10 years ago

What is does the output of genome sample list --filter individual_common_name=TST1 look like in your test system?

ghost commented 10 years ago

I ran the command you requested. It displays sample information which was imported from the metadata file:

ubuntu@sgms ~> genome sample list --filter individual_common_name=TST1
ID           NAME                          SPECIES_NAME   PATIENT_COMMON_NAME   COMMON_NAME   TISSUE_LABEL   TISSUE_DESC     EXTRACTION_TYPE   EXTRACTION_LABEL   EXTRACTION_DESC
--           ----                          ------------   -------------------   -----------   ------------   -----------     ---------------   ----------------   ---------------
2889953341   H_NJ-HCC1395-HCC1395_BL_RNA   human          TST1                  normal        <NULL>         b lymphoblast   rna               HCC1395 BL_RNA     <NULL>
2889953342   H_NJ-HCC1395-HCC1395_RNA      human          TST1                  tumor         <NULL>         epithelial      rna               HCC1395_RNA        <NULL>
2889981253   H_NJ-HCC1395-HCC1395          human          TST1                  tumor         <NULL>         epithelial      genomic dna       HCC1395            <NULL>
2889981254   H_NJ-HCC1395-HCC1395_BL       human          TST1                  normal        <NULL>         b lymphoblast   genomic dna       HCC1395 BL         <NULL>

The import script imports samples under the "TST1ds" name, so also ran that command:

ubuntu@sgms ~> genome sample list --filter individual_common_name=TST1ds
ID                                 NAME                            SPECIES_NAME   PATIENT_COMMON_NAME   COMMON_NAME   TISSUE_LABEL   TISSUE_DESC     EXTRACTION_TYPE   EXTRACTION_LABEL   EXTRACTION_DESC
--                                 ----                            ------------   -------------------   -----------   ------------   -----------     ---------------   ----------------   ---------------
24a5cea56f2b4b84aa6b6a2bbe28b155   H_NJ-HCC1395ds-HCC1395          human          TST1ds                tumor         <NULL>         epithelial      genomic dna       HCC1395            <NULL>
83fa30ecfc294d7fa5259154941f0ddc   H_NJ-HCC1395ds-HCC1395_BL       human          TST1ds                normal        <NULL>         b lymphoblast   genomic dna       HCC1395 BL         <NULL>
9672ea6e1a404571a004b307a79316d0   H_NJ-HCC1395ds-HCC1395_BL_RNA   human          TST1ds                rna normal    <NULL>         b lymphoblast   rna               HCC1395 BL_RNA     <NULL>
9e97dafdb8094f64b0e1896c2f477b95   H_NJ-HCC1395ds-HCC1395_RNA      human          TST1ds                rna tumor     <NULL>         epithelial      rna               HCC1395_RNA        <NULL>

There is a difference in the common name for the rna samples between the TST1 and TST1ds samples (eg, "tumor" vs "rna tumor"). It looks like Obi wrote that code as part of a large commit. Should we change the common name of the rna samples to match how they are for TST1?

malachig commented 10 years ago

Yes, that look like a typo. The sample common name should be 'normal', 'tumor', etc. The 'rna' vs. 'genomic dna' is used for extraction type. That part looks correct.

malachig commented 10 years ago

@mkiwala-g did you ever look into the issue above that: "clinseq update-analysis filters out instrument data in the Solexa subclass. The data imported by this tool is in the "Imported" subclass. Is ok to remove the filter for Solexa only reads?"

I just went through the data import tutorial and encountered the same problem. It seems that importing the data has worked but clin-seq advise does not recognize any of the data...

malachig commented 10 years ago

Oh I see. I need to use the --allow-imported option

malachig commented 10 years ago

Everything seems to be working smoothly with the imported data and builds using that data. We will still want to make a more user friendly tool to help the naive user import their data but that can be accomplished as a separate issue when we have more clearly defined the requirements.

Really nice job on this @mkiwala-g