churchmanlab / genewalk

GeneWalk identifies relevant gene functions for a biological context using network representation learning
https://churchman.med.harvard.edu/genewalk
BSD 2-Clause "Simplified" License
128 stars 15 forks source link

Id_type for mouse genes #1

Closed whelena closed 4 years ago

whelena commented 4 years ago

I have a text file of mouse genes with its MGI_ID (e.g. MGI:894679). However when I ran genewalk I receive errors : genewalk.gene_lists - Could not get HGNC ID for MGI ID although the code kept running. Is this an issue and if so should I convert the gene names into HGNC ID instead?

bgyori commented 4 years ago

Hi @hkanya, unless there is a systematic problem with the format of your gene list file, it's likely that a subset of MGI IDs couldn't be mapped to an HGNC ID. GeneWalk will print a warning for each specific case but otherwise proceed with the analysis. The example you posted (MGI:894679) looks mappable to me, do you have specific examples of ones that couldn't be mapped, and for which you got a warning?

ri23 commented 4 years ago

Some expected numbers to compare your run with: we found that for ~5% of our 1861 input MGI ids no human HGNC ids were found in our study of mouse RNA-seq (see biorxiv paper for details). Hope this helps you interpret your results.

whelena commented 4 years ago

I used 851 mouse genes and I'm not sure how many of them were unmappable but the genewalk_result.csv file is empty.

The genewalk_all.log looks something like this: WARNING: [2019-09-11 10:33:06] genewalk.gene_lists - Could not get HGNC ID for MGI ID "MGI:3644854" WARNING: [2019-09-11 10:33:06] genewalk.gene_lists - Could not get HGNC ID for MGI ID "MGI:1926387" WARNING: [2019-09-11 10:33:06] genewalk.gene_lists - Could not get HGNC ID for MGI ID "MGI:1920198" WARNING: [2019-09-11 10:33:06] genewalk.gene_lists - Could not get HGNC ID for MGI ID "MGI:2141599" WARNING: [2019-09-11 10:33:06] genewalk.gene_lists - Could not get HGNC ID for MGI ID "MGI:1916394" WARNING: [2019-09-11 10:33:06] genewalk.gene_lists - Could not get HGNC ID for MGI ID "MGI:1916672" WARNING: [2019-09-11 10:33:06] genewalk.gene_lists - Could not get HGNC ID for MGI ID "MGI:1923724" WARNING: [2019-09-11 10:33:06] genewalk.gene_lists - Could not get HGNC ID for MGI ID "MGI:103582" WARNING: [2019-09-11 10:33:06] genewalk.gene_lists - Could not get HGNC ID for MGI ID "MGI:1354731" WARNING: [2019-09-11 10:33:06] genewalk.gene_lists - Could not get HGNC ID for MGI ID "MGI:2449316" WARNING: [2019-09-11 10:33:06] genewalk.gene_lists - Could not get HGNC ID for MGI ID "MGI:2444923" WARNING: [2019-09-11 10:33:06] genewalk.gene_lists - Could not get HGNC ID for MGI ID "MGI:1098583"

ri23 commented 4 years ago

It seems your input file contains the gene ids with quotations "", for example "MGI:3644854". I checked our code and currently that is the reason why all of your genes don't map. For now, I would advice you to use a list without the quotations and try again. If you are getting the gene list from R, then you can use quote=FALSE to do this: for example write.table(mgi_list, file = file_name,row.names=FALSE,col.names=FALSE,quote=FALSE).

whelena commented 4 years ago

I tried that and I am still receiving the same error, do I need to convert it into human genes in R before using it on genewalk?

error: WARNING: [2019-09-12 10:53:13] genewalk.gene_lists - Could not get HGNC ID for MGI ID 1921920 WARNING: [2019-09-12 10:53:13] genewalk.gene_lists - Could not get HGNC ID for MGI ID 3643362 WARNING: [2019-09-12 10:53:13] genewalk.gene_lists - Could not get HGNC ID for MGI ID 3643618 WARNING: [2019-09-12 10:53:13] genewalk.gene_lists - Could not get HGNC ID for MGI ID 3781689 WARNING: [2019-09-12 10:53:13] genewalk.gene_lists - Could not get HGNC ID for MGI ID 3649964 WARNING: [2019-09-12 10:53:13] genewalk.gene_lists - Could not get HGNC ID for MGI ID 3642245

ri23 commented 4 years ago

The output you are getting now is as expected. Note that many of the ids that previously failed are no longer in your log file (eg MGI:3644854) so they are properly mapped now. The remaining ones are mostly uncharacterized mouse genes, like pseudogenes / predicted genes (eg http://www.informatics.jax.org/marker/MGI:3643362) for which unfortunately no mapping to a human ortholog can be made reliably. See my message from yesterday: we had about 5% of our mouse genes not mapped because of that. In your case (851 input genes) that would be about 40-50 genes with a warning in your current log file. Can you see if that is about right? Also if you have a computer with 4 cores or access to a cluster, you can reduce the GeneWalk run time to 2-3hours and see your results file more quickly (see readme page for details): genewalk --project context1 --genes gene_list.txt --id_type mgi_id --nproc 4.

whelena commented 4 years ago

Okay, I'll see once the run is over. Thank you so much for your help!

bgyori commented 4 years ago

Hi @hkanya, let us know if you are still having issues, otherwise I'll close this thread

whelena commented 4 years ago

It's working now thank you!