Taiji-pipeline / Taiji

All-in-one analysis pipeline
https://taiji-pipeline.github.io/
BSD 3-Clause "New" or "Revised" License
34 stars 9 forks source link

basic questions #23

Closed dimitras closed 2 years ago

dimitras commented 2 years ago

Hi Kai, I am trying to run Taiji on my bulk ATAC- and RNA-Seq data. The assembly used is hg19.

So my config file looks like:

input: "input.tsv"
output_dir: "output/"
assembly: "hg19"
genome: "annotation/Homo_sapiens.GRCh37.dna.primary_assembly.fa"
annotation: "annotation/Homo_sapiens.GRCh37.87.gtf"
motif_file: "annotation/cisBP_human.meme"

submit_params: "--time=12:00:00"
submit_command: "sbatch"
submit_cpu_format: "--cpus-per-task=24"
submit_memory_format: "--mem=64g"

One thing I'm confused about is how to get the correct meme file for the motifs from hg19. I have tried downloading from meme suite but I suppose the memes provided there are based on the latest genome?

Another question I have is about the input.tsv. For the ATAC, is it enough if I give for each sample the narrow peaks bed file like:

type    id  group   rep path    tags    format
ATAC-seq    h40 cond1   1   h40.narrow_Peaks.bed    Bed NarrowPeak

Or should I give the bam file as well? Like:

type    id  group   rep path    tags    format
ATAC-seq    h40 cond1   1   h40.narrow_Peaks.bed    Bed NarrowPeak
ATAC-seq    h40 cond1   1   h40.bam Bam Aligned Reads

For the RNA, my input per sample looks like that: RNA-seq H80 cond1 60 H80.abundance.txt GeneQuant Expression profile

And then running taiji run --config config.yml --cloud I get the following error:

[INFO][04-14 15:23] RNA_Download_Data(1f85..): Complete!
[INFO][04-14 15:23] RNA_Download_Data(76eb..): Complete!
[INFO][04-14 15:23] RNA_Download_Data(826c..): Complete!
[ERROR][04-14 15:24] RNA_Make_Index(1ca6..) Failed: printf: formatting string ended prematurely
[ERROR][04-14 15:24] Program exits with errors

It has created the sciflow.db and in the output the only thing is created is:

output
└── RNASeq
    └── Download

I even tried with giving a fai file as well. Created .fai with samtools and added to config: genome_index: "annotation/Homo_sapiens.GRCh37.dna.primary_assembly.fa.fai"

But I got the exact same error.

Any insights would be greatly appreciated! Thank you!

kaizhang commented 2 years ago

Sorry for the late response.

One thing I'm confused about is how to get the correct meme file for the motifs from hg19. I have tried downloading from meme suite but I suppose the memes provided there are based on the latest genome?

Motif file is not assembly-specific. Here is the motif file I used for the human genome: https://taiji-pipeline.github.io/documentation/_downloads/cisBP_human.meme.

Another question I have is about the input.tsv. For the ATAC, is it enough if I give for each sample the narrow peaks bed file like:

If you have bam files, it is better to include that as well, using:

type    id  group   rep path format
ATAC-seq    h40 cond1   1   h40.narrow_Peaks.bed    NarrowPeak
ATAC-seq    h40 cond1   1   h40.bam Bam

For the RNA, my input per sample looks like that: RNA-seq H80 cond1 60 H80.abundance.txt GeneQuant Expression profile

Correct format should be something like this:

type    id  group   rep path tags
RNA-seq H80 cond1 60 H80.abundance.txt GeneQuant

Here is an example file: https://taiji-pipeline.github.io/documentation/_downloads/example_input.tsv

I even tried with giving a fai file as well. Created .fai with samtools and added to config:

Genome index is generated by Taiji. Don't generate it by yourself.

dimitras commented 2 years ago

Thank you for your response Kai.

I have an issue that I am not sure how to resolve. Taiji runs up to the Confusion_Table step and then crashes. I wonder if there is sth in the bed or bad format I should check or maybe it is a resources issue?

[INFO][04-18 12:20] Confusion_Table(f7b8..): Running ...
[INFO][04-18 12:20] Confusion_Table(f7b8..): Complete!
[INFO][04-18 12:20] ATAC_Make_Index(d30b..): Running ...
[ERROR][04-18 12:20] ATAC_Make_Index(d30b..) Failed: user error (call: remote process died: DiedDisconnect)
CallStack (from HasCallStack):
  error, called at src/Control/Workflow/Interpreter/Exec.hs:146:37 in SciFlow-0.8.0-IRKsT2ba9M716PeGlwt2FT:Control.Workflow.Interpreter.Exec
[ERROR][04-18 12:20] Program exits with errors

The output looks like:

output/
├── ATACSeq
│   └── Download
├── GENOME
│   └── HG19.index
├── RNASeq
│   ├── Download
│   ├── expression_profile.tsv
│   ├── Gt001.BM_rep8_gene_quant.tsv
│   ├── Gt001.BN_rep8_gene_quant.tsv
........
── SCATACSeq
    └── Spectral

The sciflow.db file for last two steps looks like:

taiji show Confusion_Table
<< Confusion_Table(f7b838d7b32a98cefcdccfb0946a253e56b1c6b7ef47c502fb141a0e1c15cd5a) >>
()

taiji show ATAC_Make_Index
(which returns nothing)

Is the ATAC_Download_Data step as it should be?

taiji show ATAC_Download_Data
<< ATAC_Download_Data(fe0fc639c515e5d42e63fff5fbb95d4269c9f5d607c771aff63f0a834834b4d7) >>
ATACSeq
  { atacseqCommon =
      CommonFields
        { _commonEid = "Gt111.BN"
        , _commonGroupName = Just "BN"
        , _commonSampleName = ""
        , _commonReplicates =
            fromList
              [ ( 1
                , Replicate
                    { replicateFiles =
                        [ Left
                            ( []
                            , Bed
                            , File
                                { fileLocation =
                                    "MACS2.471.narrow_Peaks.bed"
                                , fileInfo = fromList []
                                }
                            )
                        , Left
                            ( []
                            , Bam
                            , File
                                { fileLocation =
                                    "471.bam"
                                , fileInfo = fromList []
                                }
                            )
                        ]
                    , _replicateInfo = fromList []
                    }
                )
              ]
        }
  }

Any insight would be so appreciated. Let me know if you need more info.

Thanks again, Dimitra

kaizhang commented 2 years ago

The ATAC_Make_Index step is failing. You could run:

Taiji run --config your_config_file --select ATAC_Make_Index

This runs the ATAC_Make_Index step only. Please see if the error message is more clear this time. If you are running the program using a workload manager/distributed computing, please look into the log files as well.

dimitras commented 2 years ago

Thank you for your quick response!

The slurm has logged: /var/spool/slurm/slurmd/job36819951/slurm_script: line 3: 73488 Bus error master_id=tsetmtnhyfmyfxtf taiji remote --ip cn2316 --port 47481

But yes let me try to run ATAC_Make_Index as you suggested.

dimitras commented 2 years ago

Running only the ATAC_Make_Index step, I got the exact same error:

[INFO][04-18 15:10] Found a new worker: pid://cn1042:8000:0:8
[INFO][04-18 15:10] ATAC_Make_Index(d30b..): Running ...
[ERROR][04-18 15:10] ATAC_Make_Index(d30b..) Failed: user error (call: remote process died: DiedDisconnect)
CallStack (from HasCallStack):
  error, called at src/Control/Workflow/Interpreter/Exec.hs:146:37 in SciFlow-0.8.0-IRKsT2ba9M716PeGlwt2FT:Control.Workflow.Interpreter.Exec
[ERROR][04-18 15:10] Program exits with errors

And in the slurm log:

/var/spool/slurm/slurmd/job36841772/slurm_script: line 3: 26442 Bus error master_id=uwqjacxmjumfhbaz taiji remote --ip cn0910 --port 52092

kaizhang commented 2 years ago

ATAC_Make_Index calls bwa to build the index which uses a lot of memory. Please make sure you request enough memory when using slurm. For example, 50G

dimitras commented 2 years ago

I was giving it 80G, but I gave it 120G this time. It was both memory and apparently a permission issue. The index file that was generated in an earlier step didn't have exec permission.

Generating genome index ...

[INFO][04-18 15:24] ATAC_Make_Index(d30b..): Running ...
[ERROR][04-18 15:24] ATAC_Make_Index(d30b..) Failed: output//GENOME/HG19.index: openFile: resource busy (file is locked)
CallStack (from HasCallStack):
  error, called at src/Control/Workflow/Interpreter/Exec.hs:146:37 in SciFlow-0.8.0-IRKsT2ba9M716PeGlwt2FT:Control.Workflow.Interpreter.Exec
[ERROR][04-18 15:24] Program exits with errors

-rw-r----- 1 dimitras 0 Apr 18 15:10 output/GENOME/HG19.index

After giving it exec permission, it FINISHED :)

Thank you so much for your help troubleshooting this!

Can I run it to go on through all the steps till the end? Or better run it from the beginning?

kaizhang commented 2 years ago

just run taiji run --config your_config.yml normally.

dimitras commented 2 years ago

Hi Kai, thank you so much for your help. I haven't posted yet because I didn't manage to run it successfully to the end yet. I was in the middle of troubleshooting but had to pause for a few days. I'll post again if I don't manage to resolve it. Thank you again!