Ruv normalization support and transcipt analysis

arcolombo commented 8 years ago

This PR supports RUV and solved a few bugs for transcript level analysis. the TRAVIS has been turned off , and it may be better to not merge this until TRAVIS build time supports unit testing as opposed to work flow testing

arcolombo commented 8 years ago

This merge uses bash scripts to support SRAdb , and fasq-dump. although fastq-dump is not an ideal way to grab sra -> fastq, these shell scripts will adjust the sra headers and map them into illumina standard headers. this was not tested on all SRA headers, so I won't merge this until I can ascertain if the SRADb headers are somewhat uniform.... this PR assumes that the SRAdb headers are in the same format.

The SRA headers I encountered were in this format

@SRR3173882.sra.1 HWI-ST1209-LAB:323:HA9TPADXX:1:1101:1408:2086 length=50

so I will have to look at several other SRAdb files and see if the SRA headers match this format.

arcolombo commented 8 years ago

this passed the checks, FYI: this can be merged, although the SRA import header conversion files will likely be deprecated. however the fastq basespace upload functions I think are useful. this also has RUV support. merging.

arcolombo commented 8 years ago

As a note the SRA header conversion script was written after looking at Xin et.al pancreatic SRR submission, and importing from SRAdb. I did not have time to look at every format in SRAdb, ... the weaker commits in this merge are weak because they assume that SRAdb has the following: @SRR3173882.sra.1 HWI-ST1209-LAB:323:HA9TPADXX:1:1101:1408:2086 length=50

the good news is that I learned the basics of awk, so the bash scripts work efficiently for a single fastq, and could be done in parallel.

perhaps this should be extended to databases not supported by basespace ENA, TCGA, and write a general Rawk script which converts fastqs from ENA/TCGA to Illumina standard. as long as a database has uniformity, converting is "easy".

~HAL

ttriche commented 8 years ago

don't waste time on converting every format in SRAdb; that's Illumina's problem IMHO. Automating the load of SRA studies from BaseSpace via the SRA import app, on the other hand, could be very useful. The goal is to reduce movement of data: it costs time, bandwidth, and money.

--t

On Fri, Apr 29, 2016 at 12:43 PM, Anthony R. Colombo < notifications@github.com> wrote:

As a note the SRA header conversion script was written after looking at Xin et.al pancreatic SRR submission, and importing from SRAdb. I did not have time to look at every format in SRAdb, ... the weaker commits in this merge are weak because they assume that SRAdb has the following: @SRR3173882.sra.1 HWI-ST1209-LAB:323:HA9TPADXX:1:1101:1408:2086 length=50

the good news is that I learned the basics of awk, so the bash scripts work efficiently for a single fastq, and could be done in parallel.

perhaps this should be extended to databases not supported by basespace ENA, TCGA, and write a general Rawk script which converts fastqs from ENA/TCGA to Illumina standard. as long as a database has uniformity, converting is "easy".

~HAL

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/RamsinghLab/arkas/pull/3#issuecomment-215858856

arcolombo commented 8 years ago

yes this is a good idea. you can execute apps via bs CLI, so running SRA import from R should be put on the menu.

sounds good AC

On Fri, Apr 29, 2016 at 1:32 PM, Tim Triche, Jr. notifications@github.com wrote:

don't waste time on converting every format in SRAdb; that's Illumina's problem IMHO. Automating the load of SRA studies from BaseSpace via the SRA import app, on the other hand, could be very useful. The goal is to reduce movement of data: it costs time, bandwidth, and money.

--t

On Fri, Apr 29, 2016 at 12:43 PM, Anthony R. Colombo < notifications@github.com> wrote:

As a note the SRA header conversion script was written after looking at Xin et.al pancreatic SRR submission, and importing from SRAdb. I did not have time to look at every format in SRAdb, ... the weaker commits in this merge are weak because they assume that SRAdb has the following: @SRR3173882.sra.1 HWI-ST1209-LAB:323:HA9TPADXX:1:1101:1408:2086 length=50

the good news is that I learned the basics of awk, so the bash scripts work efficiently for a single fastq, and could be done in parallel.

perhaps this should be extended to databases not supported by basespace ENA, TCGA, and write a general Rawk script which converts fastqs from ENA/TCGA to Illumina standard. as long as a database has uniformity, converting is "easy".

~HAL

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/RamsinghLab/arkas/pull/3#issuecomment-215858856

— You are receiving this because you modified the open/close state. Reply to this email directly or view it on GitHub https://github.com/RamsinghLab/arkas/pull/3#issuecomment-215873365

RamsinghLab / arkas

Ruv normalization support and transcipt analysis #3