aertslab / create_cisTarget_databases

Create cisTarget databases
37 stars 8 forks source link

How to deal with genes or regions in human unmapped in other species by liftover #25

Open yangliu0729 opened 1 year ago

yangliu0729 commented 1 year ago

Hi, I have a problem when using create_cross_species_motifs_rankings_db.py to get the ranking file.

Before this step: (1). ~20,000 human genes were selected, which in the hg19-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings.feather. (2). I downloaded fasta files of 7 species(including bosTau4 (Bos Taurus), canFam2 (Canis familiaris), mm9 (Mus musculus), monDom5 (Monodelphis domestica), panTro2 (Pan troglodytes), rheMac2 (Macaca mulatta), rn4 (Rattus norvegicus)) from UCSC. (3). I've got 500bp upstream of 20,000 human genes, and orthologous regions between 7 species were got by liftover with default parameters. (4). Using create_cistarget_motif_databases.py, I got 8 feather files as input of create_cross_species_motifs_rankings_db.py.

But something went wrong: AssertionError: Feather rankings database "test.monDom5.motifs_vs_regions.rankings.feather" contains different region or gene IDs or in a different order than in "test.mm9.motifs_vs_regions.rankings.feather". So, I unified the gene IDs and orders. In this step, the sequences of genes those can't find orthologous regions by liftover were replaced with NNNNNNNNNNNNN.(I don't know if this is the right thing to do)

Then, I updated the input files of create_cross_species_motifs_rankings_db.py. It worked successfully. However, when I compared the same motif, the final result I got is different from the result in hg19-500bp-upstream-7species.mc9nr.genes_vs_motifs.rankings.feather.(I only compared the genes ranked as 0,1,2,3,4,5, which were completely different results.)

Could you give me some suggestions? Thank you for your help.

Best, LiuYang

ghuls commented 1 year ago

When scoring the motifs for the species for which you lifted over regions from human, use --fasta-original-species ORIGINAL_SPECIES_FASTA_FILENAME as extra parameter (FASTA file with regions in human) so all regions are retained.