ababaian / LIONS

LIONS is a bioinformatic analysis pipeline which brings together a few pieces of software and some home-brewed scripts to annotate a paired-end RNAseq library to detect TE-intiated transcripts
GNU General Public License v3.0
27 stars 13 forks source link

LIONS container update #6

Closed ababaian closed 5 years ago

ababaian commented 5 years ago

Prepare LIONS Input Example [ Richard ]

Comments

biscuit13161 commented 5 years ago
  1. A second Docker file has been provided, which will allow docker users to build a Reference-ready container, containing all the hg19 reference material and publicly available sample material. In addition a script has been provided (source_refs.sh) which will allow the downloading of hg19 and hg38 reference materials into the correct directory structure for LIONS.
  2. The use of the docker container allows users to install all the required dependencies, and run LIONS, in a system-agnostic manner. The installation is relatively simple and does not require significant computational experience on the part of the user.
  3. Samtools is a pre-requisite of the Bedtools package used by LIONS, the version LIONS is currently using is incompatible with the output file format used by Samtools 1.X. However the docker is configured to correctly build Samtools 0.1.18 from source. We are currently testing the newest version of bedtools with regard to this point.
  4. With the introduction of the reference-ready Dockerfile, publicly available material have been provided and the input.list file has been updated to reflect these files.
  5. Lions currently carries out a input check immediately prior to running the analysis. However a script (scripts/preflight_check.sh) has been added should users wish to carry out theses checks prior to running lions. Please note that many files are generated during the analysis and soft links are used to place them in the required directories; Some file systems are unable to create softlinks for these files, resulting in run-time errors. [Artem, we could do with adding a test for this and changing the execution appropriately ... I can but wasn't sure how you would want to proceed with it]
ababaian commented 5 years ago

Hey Richard, Sorry about the email troubles. It's a bit embarrassing our IT can't keep it operational.

The original files I've linked for the ENCODE data are quite large, since the purpose is to just get a test run online we can download a smaller sub-set of that data (same accession).

GM12878 (ENCSR000EYN)
rep1_read1: https://www.encodeproject.org/files/ENCFF000CXX/@@download/ENCFF000CXX.fastq.gz
rep1_read2: https://www.encodeproject.org/files/ENCFF000CYN/@@download/ENCFF000CYN.fastq.gz

rep2_read1: https://www.encodeproject.org/files/ENCFF000CYH/@@download/ENCFF000CYH.fastq.gz
rep2_read2: https://www.encodeproject.org/files/ENCFF000CYX/@@download/ENCFF000CYX.fastq.gz

H1esc (ENCSR000EYP)
rep1_read1: https://www.encodeproject.org/files/ENCFF000DGR/@@download/ENCFF000DGR.fastq.gz
rep1_read2: https://www.encodeproject.org/files/ENCFF000DGZ/@@download/ENCFF000DGZ.fastq.gz

rep2_read1: https://www.encodeproject.org/files/ENCFF000DGT/@@download/ENCFF000DGT.fastq.gz
rep2_read2: https://www.encodeproject.org/files/ENCFF000DHB/@@download/ENCFF000DHB.fastq.gz

K562 (ENCSR000EYO)

rep1_read1: https://www.encodeproject.org/files/ENCFF000DWV/@@download/ENCFF000DWV.fastq.gz
rep1_read2: https://www.encodeproject.org/files/ENCFF000DXN/@@download/ENCFF000DXN.fastq.gz

rep2_read1: https://www.encodeproject.org/files/ENCFF000DXE/@@download/ENCFF000DXE.fastq.gz
rep2_read2: https://www.encodeproject.org/files/ENCFF000DXW/@@download/ENCFF000DXW.fastq.gz

Hard vs. Soft Links

I forgot about that limitation about file-systems and hard links. So the issue has been that the most recent versions of tophat/cufflinks throw errors when dealing with soft links, thus I've transitioned to hard links. This creates it's own problems as you mention and I'm not certain how best to resolve this.

I think the mv operations would be one solution with obvious draw-backs.

What I'm thinking is that once everything is ready and we can test it "fastq-to-finish"

  1. Use softlinks where possible and find work-around for tophat2 if it throws errors
  2. Use a mix of soft/hard links to fix th2 errors, detect if file-system supports this and if not use mv to overcome any errors which arise.

Does that sound reasonable?