aertslab / pycisTopic

pycisTopic is a Python module to simultaneously identify cell states and cis-regulatory topics from single cell epigenomics data.
Other
58 stars 12 forks source link

get a lower TSS enrichmet #176

Closed GGboy-Zzz closed 1 month ago

GGboy-Zzz commented 1 month ago

Hello, I met a problem in pycistopic pipeline, When I did QC for fragment file, I got a very low TSS enrichment score(less than 2) with other metrics normal. And I successfully ran all the step of pycistopic and pycistarget using the passing-RNA QC cells, I lost information on ATACseq QC (yet I get a normal TSS enrichment score in ArchR pipeline). I dont know which step caused this problem, and I hope you can give me some suggestion. Thanks for your reply! image

ghuls commented 1 month ago

Is your TSS BED file for the correct species and the correct assembly version?

What is the "normal" TSS enrichment score you get in ArchR?

GGboy-Zzz commented 1 month ago

I used code as below to generate TSS annotaion, ‘dataset = pbm.Dataset(name='mmusculus_gene_ensembl', host='http://www.ensembl.org') annot = dataset.query(attributes=['chromosome_name', 'transcription_start_site', 'strand', 'external_gene_name', 'transcript_biotype']) annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].to_numpy(dtype = str) filter = annot['Chromosome/scaffold name'].str.contains('CHR|GL|JH|MT') annot = annot[~filter] annot['Chromosome/scaffold name'] = annot['Chromosome/scaffold name'].str.replace(r'(\b\S)', r'chr\1') annot.columns=['Chromosome', 'Start', 'Strand', 'Gene', 'Transcript_type'] annot = annot[annot.Transcript_type == 'protein_coding'] annot.to_csv('./scenicplus/mouse_embryo/rawdata/ref/tssannotation.bed', sep='\t', index=False)’ and fragment file is mm10 assembly version, pipeline_name=cellranger-arc pipeline_version=cellranger-arc-2.0.1 reference_path=/bi/apps/cellranger-arc/references/refdata-cellranger-arc-mm10-2020-A-2.0.0 Is there a version issue with this?ArchR result is is provided by the article of the data source.

ghuls commented 1 month ago

If you use the lastest ensembl gene annotation for mouse from Ensembl you have it for mm39 (GRCm39) and not mm10 (GRCm38):

$ pycistopic tss gene_annotation_list | grep -i Mouse
mspicilegus_gene_ensembl    Steppe mouse genes (MUSP714)
mspretus_gene_ensembl   Algerian mouse genes (SPRET_EiJ_v1)
mmusculus_gene_ensembl  Mouse genes (GRCm39)
mpahari_gene_ensembl    Shrew mouse genes (PAHARI_EIJ_v1.1)
mmurinus_gene_ensembl   Mouse Lemur genes (Mmur_3.0)
mcaroli_gene_ensembl    Ryukyu mouse genes (CAROLI_EIJ_v1.1)
pmbairdii_gene_ensembl  Northern American deer mouse genes (HU_Pman_2.1)

Help for pycistopic tss gene_annotation_list shows you the link to archived releases: https://www.ensembl.org/info/website/archives/index.html

pycistopic tss gene_annotation_list -h
usage: pycistopic tss gene_annotation_list [-h] [-f FILTER] [-s BIOMART_HOST] [--no-cache]

options:
  -h, --help            show this help message and exit
  -f FILTER, --filter FILTER
                        Only keep list of Ensembl BioMart gene annotation names that contain specified string.

Ensembl BioMart:
  Ensembl BioMart server settings.

  -s BIOMART_HOST, --server BIOMART_HOST
                        BioMart host URL to use. Default: "http://www.ensembl.org". Archived Ensembl BioMart URLs: https://www.ensembl.org/info/website/archives/index.html (List of currently available archives).
  --no-cache            Disable caching of requests to Ensembl BioMart server.

Ensembl release 102 (Noverber 2020) is the lastest release with mm10 (GRCm38): http://nov2020.archive.ensembl.org/index.html

Use that archive subdomain as server name and as you can see for mmusculus_gene_ensembl you now get GRCm38:

$ pycistopic tss gene_annotation_list --filter mouse --server http://nov2020.archive.ensembl.org/
mspicilegus_gene_ensembl    Steppe mouse genes (MUSP714)
mmurinus_gene_ensembl   Mouse Lemur genes (Mmur_3.0)
pmbairdii_gene_ensembl  Northern American deer mouse genes (HU_Pman_2.1)
mpahari_gene_ensembl    Shrew mouse genes (PAHARI_EIJ_v1.1)
mmusculus_gene_ensembl  Mouse genes (GRCm38.p6)
mcaroli_gene_ensembl    Ryukyu mouse genes (CAROLI_EIJ_v1.1)
mspretus_gene_ensembl   Algerian mouse genes (SPRET_EiJ_v1)

Create mm10 TSS BED file with UCSC chromosome names (starting with chr):

$ pycistopic tss get_tss --name mmusculus_gene_ensembl --server http://nov2020.archive.ensembl.org/ --ucsc mm10 --to-chrom-source ucsc --output mm10.tss.bed
- Get TSS annotation from Ensembl BioMart with the following settings:
  - biomart_name: "mmusculus_gene_ensembl"
  - biomart_host: "http://nov2020.archive.ensembl.org/"
  - transcript_type: ['protein_coding']
  - use_cache: True
- Getting chromosome sizes and alias mapping for "mm10" from UCSC.
- Update chromosome names in TSS annotation to "ucsc" chromosome names.
- Writing TSS annotation BED file to "mm10.tss.bed".
GGboy-Zzz commented 1 month ago

Thanks for your Thank you for your detailed response. I've realized that my TSS version is not consistent, and I will go ahead and modify this step. Is the pycistopic tss command available in version 2.0? It seems that the 1.0 version I previously installed does not support this command. Please forgive me as I am a beginner in using the shell.

ghuls commented 1 month ago

The git version has it. Also the QC calculation is now in the command line:

https://pycistopic.readthedocs.io/en/latest/notebooks/human_cerebellum.html#QC