XiaoTaoWang / HiC_pipeline

An easy-to-use Hi-C data processing software supporting distributed computation.
http://xiaotaowang.github.io/HiC_pipeline/index.html
GNU General Public License v3.0
55 stars 20 forks source link

Question regarding META data file #8

Closed Lattesnow closed 1 year ago

Lattesnow commented 2 years ago

"$ cd workspace $ vim datasets.tsv Create a TXT file called “datasets.tsv” by vim and fill in the following content:

SRR027956 GM06990 R1 HindIII SRR027958 GM06990 R2 HindIII The meta data file should contain 4 columns: prefix of the SRA file name (in the case of the FASTQ read format, it should be the leading part of the file name apart from the “_1.fastq” or “_2.fastq” substring), cell line name, biological replicate label, and the restriction enzyme name."

Hey XiaoTao, I got one questions regarding this meta data file. Does the biological replicate label refer to the sequencing pair end readings which gave like R1.fastq and R2.fastq files? Thanks

XiaoTaoWang commented 2 years ago

No, the biological replicate label is used to guide runHiC to merge contact pairs from multiple SRA/FASTQ files that are from the same experiment. So the "R1" for this column represents "replicate 1", similarly, "R2" represents "replicate 2", and so forth.

Sorry for the confusion, but if your FASTQ files are suffixed with "R1.fastq" and "R2.fastq", please make sure rename them as "_1.fastq" and "_2.fastq" before you run runHiC.

Let me know if you have further questions.

Best, Xiaotao

Lattesnow commented 2 years ago

Hey Xiaotao, Thanks for making it clear. After I run the test data set, I did not get any bam file in the SRR027958 folder under alignment-hg19 and pair-hg19. Attached below: (runHiC) snow@DESKTOP-4EEJKVN:/mnt/d/07082022/runhic/workspace$ runHiC mapping -p /mnt/d/07082022/runhic/data/ -g hg19 -f HiC-SRA -F SRA -A bwa-mem -t 10 --include-readid --drop-seq --chunkSize 1500000 --logFile runHiC-mapping.log /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz Read 6443641 spots for /mnt/d/07082022/runhic/data/HiC-SRA/SRR027956.sra Written 6443641 spots for /mnt/d/07082022/runhic/data/HiC-SRA/SRR027956.sra /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz /home/snow/anaconda3/envs/runHiC/bin/pigz Read 7889352 spots for /mnt/d/07082022/runhic/data/HiC-SRA/SRR027958.sra Written 7889352 spots for /mnt/d/07082022/runhic/data/HiC-SRA/SRR027958.sra

root main @ 07/08/22 12:57:38: Chromosome sizes are not provided, attempt to fetch from fasta reference genome ... root main @ 07/08/22 12:58:02: Done root main @ 07/08/22 12:58:09: You didn't specify the Genome Index Path. Try to find it under /mnt/d/07082022/runhic/data/hg19 root main @ 07/08/22 12:58:09: Set --Index to /mnt/d/07082022/runhic/data/hg19/hg19.fa root main @ 07/08/22 12:58:09: Alignment results will be outputed under alignments-hg19 root main @ 07/08/22 12:58:09: Original alignments will be parsed into .pairs format under pairs-hg19 root main @ 07/08/22 12:58:09: Dump/chunk read pairs from sra format ... root main @ 07/08/22 12:58:09: Current: SRR027956 root main @ 07/08/22 12:58:09: SRR027956: Split raw SRA into chunks ... root main @ 07/08/22 14:47:16: SRR027956: Done root main @ 07/08/22 14:47:16: Current: SRR027958 root main @ 07/08/22 14:47:16: SRR027958: Split raw SRA into chunks ... root main @ 07/08/22 14:48:02: SRR027958: Done root main @ 07/08/22 14:48:02: Map read pairs to hg19 ... root main @ 07/08/22 14:48:02: Current: SRR027956 root main @ 07/08/22 14:48:02: Completed work, skip root main @ 07/08/22 14:48:02: Current: SRR027958 root main @ 07/08/22 14:48:02: Completed work, skip

All the folders created under alignment-hg19 and pairs-hg19 are empty. Is there something I missed in the quick start guide? Thanks

XiaoTaoWang commented 2 years ago

this is strange to me. Can you show me the output of the following commands? Thanks!

  1. ls -lh /mnt/d/07082022/runhic/data/hg19/
  2. ls -lh alignment-hg19/
  3. ls -lh pairs-hg19
Lattesnow commented 2 years ago

Attached below (runHiC) snow@DESKTOP-4EEJKVN:/mnt/d/07082022/runhic/data/hg19$ ls -lh total 8.1G -rwxrwxrwx 1 snow snow 2.0K Jul 8 12:58 hg19.chrom.sizes -rwxrwxrwx 1 snow snow 3.0G Aug 21 2018 hg19.fa -rwxrwxrwx 1 snow snow 8.4K Jul 30 2021 hg19.fa.amb -rwxrwxrwx 1 snow snow 4.0K Jul 30 2021 hg19.fa.ann -rwxrwxrwx 1 snow snow 3.0G Jul 30 2021 hg19.fa.bwt -rwxrwxrwx 1 snow snow 748M Jul 30 2021 hg19.fa.pac -rwxrwxrwx 1 snow snow 1.5G Jul 30 2021 hg19.fa.sa -rwxrwxrwx 1 snow snow 5 Aug 27 2021 hg19.fa_Arima.txt

(runHiC) snow@DESKTOP-4EEJKVN:/mnt/d/07082022/runhic/workspace/alignments-hg19$ ls -lh total 0 drwxrwxrwx 1 snow snow 4.0K Jul 8 12:47 SRR027956 drwxrwxrwx 1 snow snow 4.0K Jul 8 12:47 SRR027958

(runHiC) snow@DESKTOP-4EEJKVN:/mnt/d/07082022/runhic/workspace/pairs-hg19$ ls -lh total 0 drwxrwxrwx 1 snow snow 4.0K Jul 8 12:47 SRR027956 -rwxrwxrwx 1 snow snow 0 Jul 8 12:47 SRR027956.completed drwxrwxrwx 1 snow snow 4.0K Jul 8 12:47 SRR027958 -rwxrwxrwx 1 snow snow 0 Jul 8 12:47 SRR027958.completed

XiaoTaoWang commented 2 years ago

It seems the time when these files were created is inconsistent with the time in your log information ... Could you delete all the content within alignments-hg19 and pairs-hg19 folders, and re-run the same command?

Lattesnow commented 2 years ago

Yeah, that solved my problem. Thanks

Lattesnow commented 2 years ago

Hi Xiaotao, I am still confused about the meta data file. I only have one sequencing from the Arima-HiC kit with two fastq files(pair-end reads). Labeled as _1.fastq and _2.fastq. Since they are not replicates, what should I write on the meta data file to run? Thanks,

XiaoTaoWang commented 2 years ago

Suppose your FASTQ files are test_1.fastq and test_2.fastq, just make a one-line meta data file as follows:

test    your_sample_name    R1    Arima
Lattesnow commented 2 years ago

Suppose your FASTQ files are test_1.fastq and test_2.fastq, just make a one-line meta data file as follows:

test    your_sample_name    R1    Arima

Thanks