Roy-lab / scMTNI

18 stars 3 forks source link

The ’cluster.bam‘ file generated by the ’PrepareBamForLigerclusters.py‘ script does not contain any sequence information. #18

Closed xudongzhango closed 3 weeks ago

xudongzhango commented 1 month ago

When I tried to run the example data, the command ’macs2 callpeak -t ${indir}/${cluster}.bam -f BAM -g mm -n $outdir/${cluster}_peak -B‘ in the ’genPriorNetwork_scMTNI.sh‘ file reported an error. I found that it was because the BAM file produced in the previous step did not contain sequence information. Running macs2 directly with the scATACseq.bam file yields results, so I suspect the issue lies with the ’PrepareBamForLigerclusters.py’ script. I am running the example data; could the problem be due to incomplete example data? I would greatly appreciate your assistance with this matter.Please

sroyyors commented 1 month ago

hi, you might be right that the file is incomplete because we uploaded only a part of the bam file as it is too big. We are looking into this but in the mean time can you try using a full bam file and see if it works?

xudongzhango commented 1 month ago

hi, you might be right that the file is incomplete because we uploaded only a part of the bam file as it is too big. We are looking into this but in the mean time can you try using a full bam file and see if it works?

Thank you very much for your reply.

Yes, I generated the bam file from the original fastq data and re-ran the process. For now, I have been able to run up to the fourth step in genPriorNetwork_scMTNI.sh, generating the {cluster}_network.txt file. During this process, I have a few questions:

1.For testing purposes, I only used one scATAC dataset. The generated bam file does not have 'GMP', "CMP", "mono", or "HSC". In PrepareBamForLigerclusters.py, these labels are not matched. What impact will this have? 2.Can the {cluster}_network.txt generated in the fourth step be used as a preliminary gene regulatory network, or is it necessary to use the network generated by steps 5, 6, and 7? 3.In the third step, for Map peaks to genes, is the bed file 5000 bases upstream and downstream of the gene promoter? chr1 3049232 3059232 TSS_ENSMUST00000160944_Gm16088 -1 + Gm16088 What does the -1 represent in this context? Are there only -1 values?

shiluzhang commented 1 month ago

Thank you for your questions. Below are the answers:

  1. Can you show an example of the bam file you generated without the cell labels? In PrepareBamForLigerclusters.py line 40-47, it is extracting the barcodes for each cell type. So if the cell label is not present in the bam file, the script will not work. See below example of the bam file with cell barcode:

AAACCATC:NS500418:459:H57H2BGXY:1:11101:3400:14119 77 0 0 0 0 CTGTCTCTTATACACATCTCCGAGCCCACGAGACTGGTCACAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAG AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/E/AEEEEEEEEEEEAEEEEEEEEEEEEEEE66/ YT:Z:UP RG:Z:singles-160822-BM1137-CMP-LS-31 PG:Z:MarkDuplicates-584E59B3 AAACCATC:NS500418:459:H57H2BGXY:1:11101:3400:14119 141 0 0 0 0 CTGTCTCTTATACACATCTGACGCTGCCGACGATGTAGATTGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAA AAAAAEEEEEEEEEEEEEEEAEEEEEEEEAEEEEEEEEEEEEEEEEEAEEEEEEEEEEAEEEEEEEEEEE6AEEEE YT:Z:UP RG:Z:singles-160822-BM1137-CMP-LS-31 PG:Z:MarkDuplicates-584E59B3

  1. It is necessary to use the network generated by steps 5, 6, and 7. The {cluster}_network.txt generated in the fourth step is too large, which will not help in predicting the final GRN. And scMTNI requires the prior network edge weight to be a percentile (0-1).

  2. The bed file is 5000 bases upstream and downstream of the gene TSS. -1 is the 5th column of the BED file, which is the score column (https://genome.ucsc.edu/FAQ/FAQformat). It is not meaningful here, you can ignore the -1.

xudongzhango commented 1 month ago

Thank you for your questions. Below are the answers:

  1. Can you show an example of the bam file you generated without the cell labels? In PrepareBamForLigerclusters.py line 40-47, it is extracting the barcodes for each cell type. So if the cell label is not present in the bam file, the script will not work. See below example of the bam file with cell barcode:

AAACCATC:NS500418:459:H57H2BGXY:1:11101:3400:14119 77 0 0 0 0 CTGTCTCTTATACACATCTCCGAGCCCACGAGACTGGTCACAATCTCGTATGCCGTCTTCTGCTTGAAAAAAAAAG AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/E/AEEEEEEEEEEEAEEEEEEEEEEEEEEE66/ YT:Z:UP RG:Z:singles-160822-BM1137-CMP-LS-31 PG:Z:MarkDuplicates-584E59B3 AAACCATC:NS500418:459:H57H2BGXY:1:11101:3400:14119 141 0 0 0 0 CTGTCTCTTATACACATCTGACGCTGCCGACGATGTAGATTGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAA AAAAAEEEEEEEEEEEEEEEAEEEEEEEEAEEEEEEEEEEEEEEEEEAEEEEEEEEEEAEEEEEEEEEEE6AEEEE YT:Z:UP RG:Z:singles-160822-BM1137-CMP-LS-31 PG:Z:MarkDuplicates-584E59B3

  1. It is necessary to use the network generated by steps 5, 6, and 7. The {cluster}_network.txt generated in the fourth step is too large, which will not help in predicting the final GRN. And scMTNI requires the prior network edge weight to be a percentile (0-1).
  2. The bed file is 5000 bases upstream and downstream of the gene TSS. -1 is the 5th column of the BED file, which is the score column (https://genome.ucsc.edu/FAQ/FAQformat). It is not meaningful here, you can ignore the -1.

My BAM file was obtained by processing the fastq data from SRR20324838 using cellranger-atac, and its format is as follows: SRR20324838.51617931 99 chr1 3000587 60 42M = 3000587 42 GTATACTCTAGTTTCCTTTTGGAGGCACACAGGCCTGTGAGT ?????????????????????????????????????????? NM:i:0 MD:Z:42 AS:i:42 XS:i:24 CR:Z:TACATTCAGGTGAACC CY:Z:???????????????? CB:Z:TACATTCAGGTGAACC-1 RG:Z:MM3:MissingLibrary:1:HGCH7DRXX:1 TR:Z:CTGTCTCT TQ:Z:???????? SRR20324838.51617931 147 chr1 3000587 60 42M = 3000587 -42 GTATACTCTAGTTTCCTTTTGGAGGCACACAGGCCTGTGAGT ?????????????????????????????????????????? NM:i:0 MD:Z:42 AS:i:42 XS:i:24 CR:Z:TACATTCAGGTGAACC CY:Z:???????????????? CB:Z:TACATTCAGGTGAACC-1 RG:Z:MM3:MissingLibrary:1:HGCH7DRXX:1 TR:Z:CTGTCTC TQ:Z:??????? In the PrepareBamForLigerclusters.py file, I changed barcode=l.query_name.split(':')[0] to barcode=l.get_tag('CB') to obtain my barcodes. However, by doing so, I cannot retrieve 'GMP', "CMP", "mono", or "HSC" because they do not exist in the BAM file.

shiluzhang commented 3 weeks ago

Thank you for providing the example. We are working on preparing and uploading the raw bam file so that you can download the same bam file.

shiluzhang commented 2 weeks ago

We have uploaded the raw scATAC-seq data to zenodo: https://zenodo.org/records/11876980. Link to the file: https://zenodo.org/records/11876980/files/Buenrostro_priorNetwork_bamfiles.tar.gz?download=1