OpenGene / AfterQC

Automatic Filtering, Trimming, Error Removing and Quality Control for fastq data
MIT License
206 stars 51 forks source link

Specify output folder name #11

Open alezanalp opened 7 years ago

alezanalp commented 7 years ago

Is it possible to make option for specifying output folder name with the report files rather than using input files names?

Yours faithfully, Katerina

sfchen commented 7 years ago

AfterQC is designed to run in batch. So, normally AfterQC will create a QC folder, and within the QC folder there will be folders for different input fastq.

You can change the name QC to report by specifying -r report in the command line.

Then the dir tree will be like:

report/
└── R1.fq
    ├── report.html
    └── report.json

So, your requirement is not to include 'R1.fq' folder inside the report folder, and make the dir tree like:

report/
├── report.html
└── report.json

Am I right?

alezanalp commented 7 years ago

Yeah, so the user can specify "report" folder for each pair manually if running in -1 -2 mode, for example.

serge2016 commented 7 years ago

I think, that it would be perfect I user can specify some prefix for reports, e.g. --report-prefix=/path/to/dir/filename and then:

/path/to/dir/
├── filename.html
└── filename.json
sfchen commented 7 years ago

@alezanalp do you agree with @serge2016 ?

sfchen commented 7 years ago

I have submitted a commit to implement @serge2016 's idea. You can pull or download the latest master to have a try.

Now, you will get

QC
├── filename1.fq.html
└── filename1.fq.json
└── filename2.fq.html
└── filename2.fq.json
...

And you can change folder name from QC to report by specifying -r report . And you can also specify an absolute path by -r /path/to/dir/

alezanalp commented 7 years ago

@sfchen Yes, I agree with @serge2016 . Thank you for the prompt reply. Will try it

serge2016 commented 7 years ago

There is one more "issue" or bag with this in v0.9.0: If I run after.py --read1_file=SRR3184279_1.fastq.gz --read2_file=SRR3184279_2.fastq.gz --read1_flag=_1 --read2_flag=_2 --qc_only then I get everything ok:

$(pwd)/QC/
└── SRR3184279_1.fastq.gz
    ├── report.html
    └── report.json

But if I run after.py --read1_file=SRR3184279_1.fastq.gz --read2_file=SRR3184279_2.fastq.gz --read1_flag=_1 --read2_flag=_2 --qc_only --report_output_folder=$(pwd) then I get:

SRR3184279_1.fastq.gz options:
{'qc_only': True, 'version': '0.9.0', 'seq_len_req': 35, 'index1_file': None, 'trim_tail': 0, 'report_output_folder': '/home/bg/kate/AfterQC/PE_reads/', 'trim_pair_same': True, 'no_correction': False, 'debubble_dir': 'debubble', 'barcode_flag': 'barcode', 'read2_file': 'SRR3184279_2.fastq.gz', 'barcode_length': 12, 'trim_tail2': 0, 'unqualified_base_limit': 60, 'allow_mismatch_in_poly': 2, 'read2_flag': '_2', 'store_overlap': False, 'debubble': False, 'read1_flag': '_1', 'index2_flag': 'I2', 'draw': True, 'index1_flag': 'I1', 'mask_mismatch': False, 'barcode': False, 'overlap_output_folder': None, 'barcode_verify': 'CAGTA', 'index2_file': None, 'qualified_quality_phred': 15, 'trim_front': 9, 'good_output_folder': 'good', 'poly_size_limit': 35, 'n_base_limit': 5, 'qc_sample': 200000, 'trim_front2': 9, 'no_overlap': False, 'input_dir': None, 'read1_file': 'SRR3184279_1.fastq.gz', 'qc_kmer': 8, 'bad_output_folder': None}

Traceback (most recent call last):
  File "/home/bg/soft/AfterQC-0.9.0/after.py", line 221, in <module>
    main()
  File "/home/bg/soft/AfterQC-0.9.0/after.py", line 215, in main
    processOptions(options)
  File "/home/bg/soft/AfterQC-0.9.0/after.py", line 171, in processOptions
    filter.run()
  File "/home/bg/soft/AfterQC-0.9.0/preprocesser.py", line 709, in run
    stat_file = open(os.path.join(qc_dir, "report.json"), "w")
IOError: [Errno 20] Not a directory: '/home/bg/kate/AfterQC/PE_reads/SRR3184279_1.fastq.gz/report.json'

This error occurs if I set the -r dir equal to the dir, where I run AfterQC from.

sfchen commented 7 years ago

@serge2016 this issue is because of v0.9.0 need to create a folder same as the R1 fastq file name, so it will conflict with the fastq file name if $(pwd) is specified as report_output_folder.

I believe with last commit, this issue is gone.

sfchen commented 7 years ago

I just released v0.9.1. You can have a try with the new feature described above.

serge2016 commented 7 years ago

Now previous behavior is changed to more predictable:) Thank you! But I still think about the variant when we specify -1 and -2 options: is this mode we have only one sample, so we can specify the full output name for the report.

I simply want to use your tool inside CWL environment, and it is easier to do if it is possible to specify output filenames independently from input filenames.