Do noe know which reference should be choosen for doing blast

atarashansky / SAMap

SAMap: Mapping single-cell RNA sequencing datasets from evolutionarily distant organisms.

MIT License

64 stars 19 forks source link

Do noe know which reference should be choosen for doing blast #101

Closed houruiyan closed 1 year ago

houruiyan commented 1 year ago

Hi , Thank you for your reply. Actually I still cannot understand last question. I have the single cell h5ad file. The gene var index looks like this. But I do not know which reference genome I should select to do the blast.

Could you tell me ? Which reference sequence should I select then I do not need to do some combination? http://asia.ensembl.org/info/data/ftp/index.html/

Thank you! https://github.com/atarashansky/SAMap/issues/97 This question did not be solved. Because even if I drop the .1 .2 , they still cannot match. For example, for human , the gene is ENSG , the isoform is ENST

Hope to get your answer. Thank you very much!

Ruiyan

atarashansky commented 1 year ago

Ah, got it. You want CDS (FASTA), most likely.

houruiyan commented 1 year ago

Hello Alec, I found the CDS (FASTA）also include some sequence transcript level rather than gene level. So is it reliable if I just average different transcript value for one gene by using the transform table between gene_id and transcript_id?

atarashansky commented 1 year ago

This is a parameter to the SAMAP class:

It looks like: names = {'mo': mo_mapping, 'hu': hu_mapping, ... }

mo_mapping can be a list of tuples mapping each mo fasta header to its corresponding gene ID in your mo dataset. The same for all your other species. [(fasta_id1, gene_id1),(fasta_id2, gene_id2),...]

That should be exactly what you need.

atarashansky commented 1 year ago

Closing for now, please reopen if you still have questions!

LalicJ commented 1 year ago

@atarashansky Sorry to bother you! I met the same error! My file gene names start with ENSMFAG, but the result of BLAST starts with ENSMFAT. Where and how can I modify them? Change the BLAST file or change the adata.var? I'd appreciate any help getting past this! Hope to get your answer. Thank you very much!

atarashansky commented 1 year ago

Check out my comment above yours! That parameter is what you need.

LalicJ commented 1 year ago

Yes, I understand your answer above, but I would like to ask if there is a faster way to match them one by one? Because I can't think of a good way to match them up(poor coding ability... And I'm not sure if the one-to-one array only contains the gene names of my dataset, or if I need all the gene mappings? Also, I noticed that not all transcript names(fasta_id) have corresponding gene names(gene symbol). sorry to bother you again. I'd appreciate any help getting past this! Thanks!