ENCODE-DCC / atac-seq-pipeline

ENCODE ATAC-seq pipeline
MIT License
380 stars 171 forks source link

Build genome database for your own genome and #414

Open Rafaelsoler13 opened 1 year ago

Rafaelsoler13 commented 1 year ago

Hello,

I am trying to run the pipeline for chicken samples and have tried to create a custom genome reference for the pipeline. However, after finishing the steps here [https://github.com/ENCODE-described DCC/atac-seq-pipeline/blob/master/docs/build_genome_database.md] (build_genome_database.md), the tsv file I get it fails to create the tss file, reg2map... Are these files needed to run the pipeline? If so, what can I do to get them (it doesn't say anything here [https://github.com/ENCODE-DCC/atac-seq-pipeline/blob/master/docs/build_genome_database.md])

Also, I am trying to run the pipeline with the json file generated, and it gives me this errors in the alignment:

~/ENCODE_workflow/atac-seq-pipeline-master/atac/6e18ee31-db88-4de9-b841-2b2f40910291/metadata.json
2023-04-28 16:00:02,728|caper.cromwell|INFO| Workflow failed. Auto-troubleshooting...
* Started troubleshooting workflow: id=6e18ee31-db88-4de9-b841-2b2f40910291, status=Failed
* Found failures JSON object.
[
    {
        "causedBy": [
            {
                "message": "Job atac.align:0:2 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.",
                "causedBy": []
            },
            {
                "message": "Job atac.align_mito:0:2 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.",
                "causedBy": []
            },
            {
                "message": "Job atac.align:1:2 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.",
                "causedBy": []
            }
        ],
        "message": "Workflow failed"
    }
]
* Recursively finding failures in calls (tasks)...

==== NAME=atac.align_mito, STATUS=RetryableFailure, PARENT=
SHARD_IDX=0, RC=1, JOB_ID=309275
START=2023-04-28T13:47:31.248Z, END=2023-04-28T13:50:22.132Z
STDOUT=~/ENCODE_workflow/atac-seq-pipeline-master/atac/6e18ee31-db88-4de9-b841-2b2f40910291/call-align_mito/shard-0/execution/stdout
STDERR=~/ENCODE_workflow/atac-seq-pipeline-master/atac/6e18ee31-db88-4de9-b841-2b2f40910291/call-align_mito/shard-0/execution/stderr
STDERR_CONTENTS=
Traceback (most recent call last):
  File "/software/atac-seq-pipeline/src/encode_task_bowtie2.py", line 192, in <module>
    main()
  File "/software/atac-seq-pipeline/src/encode_task_bowtie2.py", line 169, in main
    args.out_dir)
  File "/software/atac-seq-pipeline/src/encode_task_bowtie2.py", line 102, in bowtie2_pe
    tmp_bam=tmp_bam,
  File "/software/atac-seq-pipeline/src/encode_lib_common.py", line 359, in run_shell_cmd
    raise Exception(err_str)
Exception: PID=39, PGID=39, RC=127, DURATION_SEC=0.0
STDERR=/bin/bash: line 1: -1: command not found

This is the json file:

{
    "atac.title" : "Chicken_test_atac_ENCODE",
    "atac.description" : "Test performed to validate the ENCODE pipeline in Chicken",

    "atac.pipeline_type" : "atac",
    "atac.align_only" : false,
    "atac.true_rep_only" : false,

    "atac.genome_tsv" : "/media/victor/disco1/ATAC_non_canonical_species/ENCODE_test_files/chicken_GRCg7b.tsv",

    "atac.paired_end" : true,

    "atac.fastqs_rep1_R1" : [ "~/ATAC_non_canonical_species/raw_data/SRR19213758_1.fastq.gz" ],
    "atac.fastqs_rep1_R2" : [ "~/ATAC_non_canonical_species/raw_data/SRR19213758_2.fastq.gz" ],
    "atac.fastqs_rep2_R1" : [ "~/ATAC_non_canonical_species/raw_data/SRR19213759_1.fastq.gz" ],
    "atac.fastqs_rep2_R2" : [ "~/ATAC_non_canonical_species/raw_data/SRR19213759_2.fastq.gz" ],

    "atac.auto_detect_adapter" : false,

    "atac.multimapping" : 8
}

And this one the tsv file.

ref_fa | ~/ATAC_non_canonical_species/ENCODE_test_files/chicken_GRCg7b.gz
ref_mito_fa | ~/ATAC_non_canonical_species/ENCODE_test_files/chicken_GRCg7b.chrM.fa.gz
mito_chr_name | chrM
regex_bfilt_peak_chr_name | chr[\dWZ]+
chrsz | ~/ATAC_non_canonical_species/ENCODE_test_files/chicken_GRCg7b.chrom.sizes
gensz | 1053332251
bowtie2_idx_tar | ~/ATAC_non_canonical_species/ENCODE_test_files/bowtie2_index/chicken_GRCg7b.tar.gz
bowtie2_mito_idx_tar | ~/ATAC_non_canonical_species/ENCODE_test_files/bowtie2_index/chicken_GRCg7b.chrM.fa.tar.gz

Best,

Rafael

sufyazi commented 1 year ago

Hi there,

The error message says STDERR=/bin/bash: line 1: -1: command not found so I wonder if this is just an issue of you not installing dependencies. Can you double-check what line 1 is referring to here?

Your tsv file looks fine; maybe try using absolute paths (so replace ~ with the full path), and double check if all the files are where they are?

Rafaelsoler13 commented 1 year ago

I used absolute paths to run it but it still does not work. The error is with Bowtie2:

***
Error: Must specify at least one read input with -U/-1/-2
(ERR): bowtie2-align exited with value 1
STDOUT=

But I am putting corretly the fastq files:

    "atac.fastqs_rep1_R1" : [ "/media/analysis/ATAC_non_canonical_species/raw_data/SRR19213758_1.fastq.gz" ],
    "atac.fastqs_rep1_R2" : [ "/media/analysis/ATAC_non_canonical_species/raw_data/SRR19213758_2.fastq.gz" ],
    "atac.fastqs_rep2_R1" : [ "/media/analysis/ATAC_non_canonical_species/raw_data/SRR19213759_1.fastq.gz" ],
    "atac.fastqs_rep2_R2" : [ "/media/analysis/ATAC_non_canonical_species/raw_data/SRR19213759_2.fastq.gz" ],
Rafaelsoler13 commented 1 year ago

I tried to align the samples using Bowtie2 from my PC, and actually it works

bowtie2 --very-sensitive -p 8 -X 2000 -x chicken_bowtie -1 raw_data/SRR19213758_1.fastq.gz -2 raw_data/SRR19213758_2.fastq.gz -S SRR19213758.sam
  11553902 (100.00%) were paired; of these:
    2217557 (19.19%) aligned concordantly 0 times
    8884460 (76.90%) aligned concordantly exactly 1 time
    451885 (3.91%) aligned concordantly >1 times
    ----
    2217557 pairs aligned concordantly 0 times; of these:
      820611 (37.01%) aligned discordantly 1 time
    ----
    1396946 pairs aligned 0 times concordantly or discordantly; of these:
      2793892 mates make up the pairs; of these:
        2499890 (89.48%) aligned 0 times
        215809 (7.72%) aligned exactly 1 time
        78193 (2.80%) aligned >1 times
89.18% overall alignment rate

What could be happening?

sufyazi commented 1 year ago

Interesting. As I am not a developer I don't think I can help beyond this. There must be something else wrong either during the building of the custom genome, or your installation. Have you tried running a test sample using the default human genome? You should consider trying that first to rule out bad installation.

Rafaelsoler13 commented 1 year ago

Yes! Actually the tutorial run without a problem!

leepc12 commented 1 year ago

Sorry for late response, You don't those files tss file, reg2map.. They are extra data for some additional analyses in the pipeline. Disable analysis using those data. Add the following to your input JSON.

{
  "atac.enable_tss_enrich" : false,
  "atac.enable_annot_enrich" : false,
  "atac.enable_compare_to_roadmap" : false,
  "atac.enable_gc_bias" : false
}

How did u run Caper? It looks like it ran inside a docker container. What is the exact command line used for running Caper? e.g. `caper run atac.wdl -i input.json --docker"?