AG-Boerries / CAST-Seq

CAST-Seq Bioinformatic pipeline
GNU Affero General Public License v3.0
5 stars 1 forks source link

where should I change if I do cast-seq in zebrafish ? #3

Closed panxiaoguang closed 9 months ago

panxiaoguang commented 1 year ago

Dear Author,

Can you give me some instructions on which script and where I should modify in your scripts in pipeline when I use Zebrafish as the reference genome ?

Best Regards

gandrigit commented 9 months ago

Dear Xiaguang, Sorry for the delay. Somehow I did not get the notification from Github. Adding Zebrafish as reference genome is possible. You need to create a bowtie2 index for Zebrafish and change the annotation R packages: BSgenome.Hsapiens.UCSC.hg38 and TxDb.Hsapiens.UCSC.hg38.knownGene into their Zebrafish equivalent (should be BSgenome.Drerio.UCSC.danRer11 and TxDb.Drerio.UCSC.danRer11.refGene).If you need help with that please let me know.

panxiaoguang commented 9 months ago

Dear Xiaguang, Sorry for the delay. Somehow I did not get the notification from Github. Adding Zebrafish as reference genome is possible. You need to create a bowtie2 index for Zebrafish and change the annotation R packages: BSgenome.Hsapiens.UCSC.hg38 and TxDb.Hsapiens.UCSC.hg38.knownGene into their Zebrafish equivalent (should be BSgenome.Drerio.UCSC.danRer11 and TxDb.Drerio.UCSC.danRer11.refGene).If you need help with that please let me know.

Thanks a lot. Should I modify some codes in your scripts, or perhaps you could help with that if you have time? Because I saw you have added human and mouse support in SPECIES PARAMETERS of the script CAST-Seq.R, is it ok if I just modify this part if I only want to run my data in CAST-Seq mode?

gandrigit commented 9 months ago

I will update the script CAST-Seq.R accordingly. I will let you know ASAP.

gandrigit commented 9 months ago

Please check theCAST-Seq.R update (from line 902). You still have to add the files in the annotation directory. You have provide:

panxiaoguang commented 9 months ago

Please check theCAST-Seq.R update (from line 902). You still have to add the files in the annotation directory. You have provide:

  • bowtie2 index
  • chrom.sizes file
  • dr11_TSS_TES.txt You can check both human and mouse annotation folder to see what these files are.

oh, Thank you so much! I will try it.

panxiaoguang commented 9 months ago

Please check theCAST-Seq.R update (from line 902). You still have to add the files in the annotation directory. You have provide:

  • bowtie2 index
  • chrom.sizes file
  • dr11_TSS_TES.txt You can check both human and mouse annotation folder to see what these files are.

Sorry to bother you again. How did you get "hg38_TSS_TES.txt"? I choose the first five genes and search for their start and stop on Google. It seems that the Tss and Tes of some genes just use their gene locations, while others do not. So that's really confusing. I hope to get your answer. (PS: the gene locations are also not same from different databases such as GeneCard, UCSC and ensembl.) In addition, should we use txStart and txEnd as the TSS and TES ?

gandrigit commented 9 months ago

For Human and Mouse, we use biomaRt to retrieve this information from the enter gene id (entrez):

results <- getBM(attributes = c("entrezgene_id", "chromosome_name", "start_position", "end_position", "strand"),
        filters = c("entrezgene_id"),
        values = entrez, mart = ensembl)

So TSS and TES represent the start_position and end_position respectively. This information is used to annotate the sites in the end. You can always run the pipeline, and check the sites manually with different ressources.

panxiaoguang commented 9 months ago

Thank you so much, and I will try it. In addition, when I'm looking at the additional files, what's the meaning of the start and end positions in file "ots.bed"? It seems to be a region that includes "predicted cutting sites - 16 bp" and "predicted cutting sites + 14 bp." Your predicted cutting site is N18 when I compare the sequence from "pos.fa" with "gRNA.fa." I want to know whether my analysis is true, and I thank you again for your help.

panxiaoguang commented 9 months ago

For Human and Mouse, we use biomaRt to retrieve this information from the enter gene id (entrez):

results <- getBM(attributes = c("entrezgene_id", "chromosome_name", "start_position", "end_position", "strand"),
      filters = c("entrezgene_id"),
      values = entrez, mart = ensembl)

So TSS and TES represent the start_position and end_position respectively. This information is used to annotate the sites in the end. You can always run the pipeline, and check the sites manually with different ressources.

Hi, I tried the function like this:

library(biomaRt)
ensembl <- useDataset(dataset = "hsapiens_gene_ensembl", biomart = "genes")
attributes = listAttributes(ensembl)
attributes[9:12,]

and the results showed as :

              name              description         page
9  chromosome_name Chromosome/scaffold name feature_page
10  start_position          Gene start (bp) feature_page
11    end_position            Gene end (bp) feature_page
12          strand                   Strand feature_page

So the start and end positions here are actually gene start and end positions, not TSS and TES, which are the same as those that were retrieved directly from the website. The version of ensebml will also influence the boundaries, so that's why I felt confused when I did that last time. Thank you very much.