conJUSTover / pSONIC

The repository serves as a public and official hosting of the pSONIC program (Conover et al., 2021).
GNU General Public License v3.0
19 stars 3 forks source link

gff issues #8

Open conJUSTover opened 2 years ago

conJUSTover commented 2 years ago

Hello,

Thank you for developing such a good tool. I'm trying to use pSONIC for inferring more one-to-one orthologs between Arabidopsis_thaliana and Gossypium_hirsutum. But I encountered many errors and not sure if i've every steps correctly. I hope you can give me some help.

So, the orignal gff file i used is GCF_007990345.1_Gossypium_hirsutum_v2.1_genomic.gff and TAIR10_GFF3_genes.gff. Based on my understanding, I first extracted the first colume (e.g. Chr1 in Ara tha), fourth colume (start_position), fifth colume (end_p), and gene ID (e.g. ID=AT1G01010.1) for each gff. I then made some modifications for chromosome name and gene id and concate them together. I then tried to use pSONIC translate function but it did work. So i have to mannully extracted gff as required (of couse, i'm not 100% sure about the accuracy of the gff but have done the best), and finally I believed i have obtained the gff file as required. Then, running MSCANX was fine, and i got the .tandom and .collinearity files. Then, i ran pSONIC but it kept saying errors like:

`start runing pSONIC Starting pSONIC Starting Singleton Search Initial filtering done: 1.4827887694040933 Tether Sets from OrthoFinder All Found: 1.6325551946957906 multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/chengyu/miniconda3/envs/psonic/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, *kwds)) File "/home/chengyu/miniconda3/envs/psonic/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "pSONIC.py", line 321, in add_names edge.append(gene_list.index(edge[0])) ValueError: '1_6800' is not in list """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "pSONIC.py", line 444, in parse_args() File "pSONIC.py", line 442, in parse_args main(args.prefix, args.orthogroups, args.threads, args.ploidy, args.sequenceIDs, args.speciesIDs) File "pSONIC.py", line 382, in main singletons = edges_to_groups(singletons, gene_names, threads) File "pSONIC.py", line 75, in edges_to_groups edge_list = pool.map(add_names_part, edge_list) File "/home/chengyu/miniconda3/envs/psonic/lib/python3.7/multiprocessing/pool.py", line 268, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/home/chengyu/miniconda3/envs/psonic/lib/python3.7/multiprocessing/pool.py", line 657, in get raise self._value ValueError: '1_6800' is not in list`

I've attached the .gff and SequenceID.txt files that I used. If you could point out where is wrong and give some help, that would be much appreciated. thank you! Gh_At.zip SequenceIDs.txt

Best regards

Originally posted by @henryeahhh in https://github.com/conJUSTover/pSONIC/issues/7#issuecomment-1099883895

conJUSTover commented 2 years ago

Hi @henryeahhh, I created a new issue to address this specifically, in hopes to keep each potential fix to a problem in its own separate thread.

From best I can tell, you downloaded this G. hirsutum gff and protein fasta files from NCBI, and their own annotation pipelines are notoriously inconsistent in regards to naming genes across different file types. For instance, the gene ID that threw the flag is not found in the gff you provided because that gene name (XP_016670680.2) is only found in the CDS entries of the gff, not in the main gene or mRNA entries. And the transcript name for that same gene (exon entries in the gff file) are different still.

I would recommend two potential ways forward: 1.) download the same reference genome from cottongen, which is a database specifically for cotton genome sequences. Cottongen is really good about holding consistent naming across genes and protein sequences. Protein fasta is found here and gff can be found here. You will have to redo the file formatting the same way you did before, but this would be the easier approach. 2.) Download the genome fasta sequence from NCBI, and pull out the transcripts and convert them to protein sequences before running OrthoFinder. If you do this, ensure that when you pull out the gene name from column 9 of the gff, you pull the correct gene name tag (I think it should be the transcript_id= tag instead of the ID= tag). This may be the more difficult option of the two, but if you pipeline already is dependent on using this annotation of the reference genome, then this is probably the easiest way forward.

Another thing that I noticed in your gff file is that for the Arabidopsis protein sequences, you did not filter out any alternatively spliced isoforms (e.g. gene names ending in anything other than ".1"). This will create artificially high numbers of genes in each ortholog group because pSONIC and MCscanX treat each isoform as a separate gene. I would remove these non-primary isoforms from both the gff and protein fasta files before you rerun pSONIC with the updated cotton annotations. I would also remove the genes that are encoded in the mitochondria and plastid genomes, which are also present in your gff file (chromosomes AT0Mt and AT0Pt) unless they are intentionally left in for use in downstream analyses.

Please let me know if you come across any other problems, or if I need to clarify anything above. I'm happy to help!

Thanks, Justin

conJUSTover commented 2 years ago

Hi @henryeahhh, I actually found a workaround for the cotton dataset that works, and I'll try to walk through what this one-liner does. Here's the command to run on the genome.gff file from NCBI:

awk 'BEGIN{FS=OFS="\t"} {if($3 == "CDS") {print $1,$4,$5,$9}}' genomic.gff | sed 's/ID=.*gene=//g' | sed 's/;.*protein_id=/\t/g' | sed 's/;.*//g' | awk '{print $1,$2,$3,$5,$4}' | sort -k4,4 -k2,2n | uniq -f 3 | sort -k5,5 -k4,4r | uniq -f 4 | awk 'BEGIN{OFS="\t"} {print $1,$4,$2,$3}' | grep "^NC_05" | sort -k2,2 > genome.JGI.rawCHR.gff

In short, this command takes the original gff file (named genomic.gff above) and pulls out all of the CDS entries, and pulls out the chromosome number, start site, end site, and description. Then it edits the description to find the protein ID and the gene ID. The rest of the scripts reduces the multiple CDS entries for each gene into one entry per protein, then one entry per gene, and removing any genes from non-chromosomal scaffolds, mitochondria, or plastids.

You will still need to edit the chromosome IDs to fit the MCScanX requirements, and edit the fasta file to only keep those in the second column of the new gff file.

Please let me know if you have any questions about this, or if you need me to explain in more detail what each step of the command above does. Justin

chengyuye commented 2 years ago

Thank you so much, @conJUSTover ! Really appreciate your careful instructions! Just one question, after editing the .gff file, should i edit the protein fasta file for keeping the those in the new .gff file and re-run orthofinder, or can i use the original orthofinder results but edit the .blast, Orthogroups.tsv, and SequenceIDs.txt files to keep those only in the new .gff file? Anyway, hope this method can work! I'll try it on Monday morning.

BTW, the reason I have to use this NCBI gff and genome fasta is because we are doing a single-cell cotton project and we found this version has better alignment. I hope to identify more one-to-one orthologs between cotton and Ara using pSONIC because orthofinder only indicates around 1800 one-to-one orthologs, which is generally not sufficient for cross-species comparision.

Thank you again for developing this tool, and really appreciate your instructions! Will keep you updated

conJUSTover commented 2 years ago

Yes, I would rerun OrthoFinder again since many of the genes from the previous run won't be in your new gff file.

Totally understandable with the annotation version. My guess is that pSONIC may give similar numbers of one-to-one orthologs because of the extensive polyploid histories in both Gossypium (5-6x multiplication at the base of Malvaceae family, and 2x more recently) and Arabidopsis (two separate 2x events throughout the Brassicales phylogeny). So, we would expect every gene to be present in multiple copies in both species unless there have similar and consistent patterns of fractionation and gene loss in both clades. I would be sure to include this information in the optional (but highly recommended in your case) .ploidy file when you run pSONIC.

Please let me know how this turns out, and I'd be excited to hear how it compares to Orthofinder in this context!

chengyuye commented 2 years ago

Hi Justin! Thank you very much for your reply. Based on your suggestions, I got my new .gff file and extracted corresponding protein sequences, then re-run the orthofinder this morning. But when finally running pSONIC, I still got the error messages like following:

Starting pSONIC Starting Singleton Search Traceback (most recent call last): File "pSONIC.py", line 444, in <module> parse_args() File "pSONIC.py", line 442, in parse_args main(args.prefix, args.orthogroups, args.threads, args.ploidy, args.sequenceIDs, args.speciesIDs) File "pSONIC.py", line 379, in main species_nums, singletons = SingleCopy_from_Orthofinder(orthogroups, tandem_network, tandem_pairs, ploidies, gene_code, species_dict) File "pSONIC.py", line 47, in SingleCopy_from_Orthofinder genes = [gene_code[gene] for gene in genes] File "pSONIC.py", line 47, in <listcomp> genes = [gene_code[gene] for gene in genes] KeyError: 'XP_016666881.1'

I'm pretty sure this time all the files should be correct? Would you mind having a look? I've now attached every files I used in case you want to do some tests. Thank you very much for your help! https://drive.google.com/file/d/1kCCuDSmze1AUbnFlP9qBxhV3eYDqXWqH/view?usp=sharing

ssamberkar commented 2 years ago

Thank you so much, @conJUSTover ! Really appreciate your careful instructions! Just one question, after editing the .gff file, should i edit the protein fasta file for keeping the those in the new .gff file and re-run orthofinder, or can i use the original orthofinder results but edit the .blast, Orthogroups.tsv, and SequenceIDs.txt files to keep those only in the new .gff file? Anyway, hope this method can work! I'll try it on Monday morning.

BTW, the reason I have to use this NCBI gff and genome fasta is because we are doing a single-cell cotton project and we found this version has better alignment. I hope to identify more one-to-one orthologs between cotton and Ara using pSONIC because orthofinder only indicates around 1800 one-to-one orthologs, which is generally not sufficient for cross-species comparision.

Thank you again for developing this tool, and really appreciate your instructions! Will keep you updated

From my experience, so long as the merged GFF contains all the transcripts/peptides from the individual fasta files, you shouldn't have any issues.

conJUSTover commented 2 years ago

Hi @henryeahhh, I'm so sorry for replying so late. The end of the semester was kind of crazy for me, but I have some more time now.

I'm having trouble downloading the files from the folder you linked to you - Could you try attaching the SequencIDs.txt and the .collinearity file directly here? I can take a look at it as soon as I can.

conJUSTover commented 2 years ago

Actually, I just got it. For some reason, the link the issue doesn't work, but copy/pasting the link into a window works just fine. I'll work on it now, with a hopeful answer soon

conJUSTover commented 2 years ago

So it appears that MCScanX isn't including the Arabidopsis genes in it's final .collinearity file, but it's not throwing an error. I'm not sure what the problem here is. Could you send me the raw protein fasta files you used, and I can try to recreate the error from the beginning to see where it's coming from?

chengyuye commented 2 years ago

So it appears that MCScanX isn't including the Arabidopsis genes in it's final .collinearity file, but it's not throwing an error. I'm not sure what the problem here is. Could you send me the raw protein fasta files you used, and I can try to recreate the error from the beginning to see where it's coming from?

Thank you very much for your reply! Totally understandable and hope this issue won't take you too much time. Here are the protein files that I used (Gh_protein.fa has been modified following your previous instructions). https://drive.google.com/drive/folders/1XZE9gdIKTBYMvVSh6ukYFkL4FLNcfXfF?usp=sharing

Again, really appreciated it!