aertslab / create_cisTarget_databases

Create cisTarget databases
37 stars 8 forks source link

pig cisTarget database issues! #16

Open Samaneh14 opened 2 years ago

Samaneh14 commented 2 years ago

Hi,

Regarding the @tropfenameimer comment on the build of Gallus gallus cisTarget database issue #4, I tried to take similar steps for building pig cisTarget database.

First, I got the regulatory fasta file through Ensembl/BioMart Ensembl_biomart Then, I made the TF motifs (JASPAR2022_CORE_vertebrates_non-redundant_pfms_jaspar.txt) in clusterbuster format and run the create_cistarget_motif_databases.py by the inputs. However, the job stopped with Traceback (most recent call last): File "~/stg_000??/SCENIC/create_cisTarget_databases/create_cistarget_motif_databases.py", line 504, in <module> main() File "~/stg_000??/SCENIC/create_cisTarget_databases/create_cistarget_motif_databases.py", line 289, in main region_or_gene_ids = RegionOrGeneIDs.get_region_or_gene_ids_from_fasta( File "~/stg_000??/SCENIC/create_cisTarget_databases/cistarget_db.py", line 150, in get_region_or_gene_ids_from_fasta region_id = line[1:].split(maxsplit=1)[0] IndexError: list index out of range

I don't know the reason for the upper errors!!! Would you please help me to fix the issue?

and also the last error ValueError: Error: region ID "SEPTIN3|ENSSSCG00000000040" is not unique in FASTA file

I prepared the fasta file by both gene name and gene ensemble ID, because some genes are not annotated in pig and just have the ID. So, just to unify the gene names as the input for SCENIC in where Gene ID is considered when the gene name is absent, I used both gene name and gene ID. However, by this most of the headers have both of them. How can I remove part of the headers to have just gene name in case of both gene name and gene ID.

Thanks for your time and help.

Best regards Samaneh

ghuls commented 2 years ago

You need to add -g, --genes argument: -g '\|ENSSSCG[0-9]+$', but the original FASTA IDs still need to be unique, so probably you need to include the transcript ID too to get a unique FASTA ID. In that case, something like this might work (if pig Ensembl transcript IDs start with ENSSSCT):

-g '\|ENSSSCG[0-9]+\|ENSSSCT[0-9.]+$'

Samaneh14 commented 2 years ago

Thanks @ghuls for your very fast reply. I will try it soon.

Samaneh14 commented 2 years ago

Hi, Another question, My rnaseq data are from S.scrofa11.1.97 while through Ensembl/BioMart just dataset105 is available for fasta file. I wonder if liftover can be applicable in this case! Thanks in advance!

ghuls commented 2 years ago

You don't need liftOver as long as your expression matrix you generated for your RNAseq data has the same gene names as the ones you are generating now in the current FASTA file. Else you will need to convert gene symbols from the S.scrofa11.1.97 annotation to the Ensembl v105 version.

Samaneh14 commented 2 years ago

Hi, Not sure to discuss this issue here, but just as following this already open issue, please let me ask how to solve the challenge to translate the motif-to-TF annotations for pig, I usedJASPAR2022_CORE_vertebrates_non-redundant_pfms_jaspar.txtand motifs-v9-nr.hgnc-m0.001-o0.0.tbl, but running SCENIC with vsn stopped within the last step (single_sample_scenic:SCENIC_APPEND:scenic:CISTARGET__MOTIF), because "None of ['features'] are in the columns". Thanks in advance. Samaneh

ghuls commented 2 years ago

You need to convert the HGNC gene symbols in motifs-v9-nr.hgnc-m0.001-o0.0.tbl to the pig gene symbols that are in your database.

Samaneh14 commented 2 years ago

They are partially the same (i.e, most of the human genes with the same symbol are found in pig). Should they be exactly replaced?

ghuls commented 2 years ago

Are the gene IDs in your database exactly the same as in human (also same casing)?

import cistarget_db

ctx_rankings_genes_vs_motifs_feather_file = 'your_file.*.feather'

gene_ids = cistarget_db.CisTargetDatabase.get_all_region_or_gene_ids_and_motif_or_track_ids_from_db(ctx_rankings_genes_vs_motifs_feather_file)[1]

print(gene_ids)

gene_ids.ids
Samaneh14 commented 2 years ago

Thanks for the hint. Some pig genes don't have orthologs in humans. Seems, I should back to the pig fasta file and just include the orthologs with human and redo the downstream steps.

ghuls commented 2 years ago

That is probably not the best idea, unless you also want to change your input expression matrix every time.

frucelee commented 2 years ago

You need to convert the HGNC gene symbols in motifs-v9-nr.hgnc-m0.001-o0.0.tbl to the pig gene symbols that are in your database.

Thank you o much for the suggestion. I also face the problem. Do you know how to convert the HGNC gene symbols in motifs-v9-nr.hgnc-m0.001-o0.0.tbl to the other species' gene symbols. Sorry for the stupid question. I tried several times but did not find a good solution. I will very very appreciate you if you can help me. Thanks in advance lee

ghuls commented 2 years ago

You can do that with Ensembl Biomart.

Example for human to mouse mapping:

# Map Ensembl gene IDs to HGNC or MGI symbols.

    # Filename: ensg_to_hgnc.ensembl${ensembl_version}.tsv
    #   * Dataset:
    #       - Ensembl Genes 105
    #       - Human genes (GRCh38.p13)
    #   * Attributes:
    #       - Gene stable ID
    #       - HGNC symbol
    #   * Unique results only
    #   * Dataset 68005 / 68005 Genes
    #   * URL: http://www.ensembl.org/biomart/martview/92afc86ff2bede11ba9cdae04563ea92?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.feature_page.ensembl_gene_id|hsapiens_gene_ensembl.default.feature_page.hgnc_symbol&FILTERS=&VISIBLEPANEL=resultspanel
    ENSG_TO_HGNC_FILENAME = os.path.join(GENE_CONVERSION_DIR, 'ensg_to_hgnc.ensembl105.tsv')

    # Filename: ensmusg2mgi_ensembl${ensembl_version}.tsv
    #   * Dataset:
    #       - Ensembl Genes 105
    #       - Mouse genes (GRCm39)
    #   * Attributes:
    #       - Gene stable ID
    #       - MGI symbol
    #   * Unique results only
    #   * Dataset 55414 / 55414 Genes
    #   * URL: http://www.ensembl.org/biomart/martview/92afc86ff2bede11ba9cdae04563ea92?VIRTUALSCHEMANAME=default&ATTRIBUTES=mmusculus_gene_ensembl.default.feature_page.ensembl_gene_id|mmusculus_gene_ensembl.default.feature_page.mgi_symbol&FILTERS=&VISIBLEPANEL=resultspanel
    ENSMUSG_TO_MGI_FILENAME = os.path.join(GENE_CONVERSION_DIR, 'ensmusg_to_mgi.ensembl105.tsv')

# Get Human Ensemble Gene ID to Mouse Ensemble Gene ID mapping and take results with confidence = 1.
    # Orthology Tables between species in Ensembl gene IDs:
    #   - Ensembl BioMart homepage:  http://www.ensembl.org/biomart/martview/
    #   - Version:                   105
    #   - Downloaded on:             2022/01/25
    #
    # Orthology Table files can be downloaded manually:
    #
    #   Filename: ensg_to_ensmusg.ensembl${ensembl_version}.tsv
    #     * Dataset:
    #         - Ensembl Genes 105
    #         - Human genes (GRCh38.p13)
    #     * Filters:
    #         - MULTI SPECIES COMPARISONS:
    #             * Homologue filters: Orthologous Mouse Genes
    #             * Only
    #     * Attributes:
    #         - Homologues
    #         - GENE: Ensembl:
    #             * Gene stable ID
    #         - ORTHOLOGUES: Mouse Orthologues:
    #             * Mouse gene stable ID
    #             * Mouse homology type
    #             * Last common ancestor with Mouse
    #             * %id. target Mouse gene identical to query gene
    #             * %id. query gene identical to target Mouse gene
    #             * Mouse orthology confidence [0 low, 1 high]
    #     * Unique results only
    #     * Dataset 19621 / 68005 Genes
    #     * URL: http://www.ensembl.org/biomart/martview/92afc86ff2bede11ba9cdae04563ea92?VIRTUALSCHEMANAME=default&ATTRIBUTES=hsapiens_gene_ensembl.default.homologs.ensembl_gene_id|hsapiens_gene_ensembl.default.homologs.mmusculus_homolog_ensembl_gene|hsapiens_gene_ensembl.default.homologs.mmusculus_homolog_orthology_type|hsapiens_gene_ensembl.default.homologs.mmusculus_homolog_subtype|hsapiens_gene_ensembl.default.homologs.mmusculus_homolog_perc_id|hsapiens_gene_ensembl.default.homologs.mmusculus_homolog_perc_id_r1&FILTERS=hsapiens_gene_ensembl.default.filters.with_mmusculus_homolog.only&VISIBLEPANEL=attributepanel
JoGraesslin commented 1 year ago

You need to convert the HGNC gene symbols in motifs-v9-nr.hgnc-m0.001-o0.0.tbl to the pig gene symbols that are in your database.

Thank you o much for the suggestion. I also face the problem. Do you know how to convert the HGNC gene symbols in motifs-v9-nr.hgnc-m0.001-o0.0.tbl to the other species' gene symbols. Sorry for the stupid question. I tried several times but did not find a good solution. I will very very appreciate you if you can help me. Thanks in advance lee

Hello, did you succeed with your endeavor? I am not sure how ENSEMBL biomart can help us since the JASPAR DB contains different species and no ENSEMBL names. I am trying to set up the DB for scenic in Zebrafish. Did you find another solution? Would be highly grateful for an update. Best, Jo

mtrebelo commented 1 year ago

@JoGraesslin Hi, were you able to set up the database for zebrafish? I am trying to run scenic+ with a zebrafish dataset and I would appreciate the help a lot. Thanks in advance.

WhenFlyWang commented 3 weeks ago

Hi, did you completed pig cisTargetdb?