Planteome / planteome-annotation-data

This is a place to discuss issues around the Planteome annotation data and store useful scripts etc.
1 stars 0 forks source link

Systemic issue in GAF file formatting #28

Closed serenalotreck closed 2 years ago

serenalotreck commented 2 years ago

I'm trying to read in a large quantity of files from the SVN Repository using the GafReader tool from the goatools library, which requires files to be valid GAF files.

In the process of doing so, I've identified 3 repeating issues in the .assoc files from the repository that prevent them from being read in with goatools' GafReader, due to them being violations of the GAF format:

  1. Misformatted date: Date is formatted as MMDDYYY instead of the required YYYYMMDD
  2. Non-valid Aspect codes: The Aspect column has a capital letter different than P, F or C
  3. Missing columns: Even though some of the columns are optional, they still need to be included as empty columns in order for the file to be correctly parsed, and several files have been missing one or more columns. This issue on goatools' repo describes why all the fields need to be present, even if empty.

Given that I was only working with a preliminary subset of 45 files, I'm concerned that this issue is quite extensive. I can provide the names of specific files as needed, but as I'm hoping to incorporate most of the database into my pipeline, I suspect the list of problematic files will keep getting longer, and I'm not sure what the best way to deal with that is. If you could let me know how I can best help you fix the issue, that would be great!

cooperl09 commented 2 years ago

Hi Serena I'm glad you found your way to this repository. We will review the issues listed above, but it would be helpful if you could list the specific files and the issues you are having with them. We are not aware of any files with missing columns. In terms of the aspect column, we annotate the data to our own ontologies (A, T, E) so those aspect symbols are different than the GO (P, F, C). Thus the GOA tools may not be appropriate for these files. You can see the annotations to the different ontologies and the aspects here: https://browser.planteome.org/amigo/search/annotation

serenalotreck commented 2 years ago

Thanks so much for your quick response! The files are below:

Files with incorrectly formatted dates:

to_germplasm_O.sativa_IRRI_GRIMS_C-traits.assoc
to_germplasm_O.sativa_IRRI_GRIMS_L-traits.assoc
to_germplasm_O.sativa_IRRI3K.assoc
to_germplasm_O.sativa_IRRI_GRIMS_PR-traits.assoc
to_germplasm_O.sativa_IRRI_GRIMS_S-traits.assoc
to_germplasm_O.sativa_IRRI_GRIMS_FG-traits.assoc
to_germplasm_Sorghum_GRIN.assoc
to_germplasm_O.sativa_IRRI_GRIMS_B-traits.assoc
to_germplasm_rice_GRIN.assoc
to_germplasm_O.sativa_IRRI_GRIMS_A-traits.assoc
to_germplasm_maize_GRIN.assoc

Files with missing columns:

go_ortholog_gene_Sorghum_bicolor.assoc
go_iprscan_Sorghum_bicolor.assoc

While I was looking for these I also realized that one of the files I thought had too few columns, actually had too many: go_gene_Oryza_Gramene.assoc

serenalotreck commented 2 years ago

I also just got a reply from @elserj via email, in which he pointed out that anything not in the go-associations, po-associations, etc directories are test/development files. None of the files are from outside/deeper than those directories, so they should be non-test or development. Specifics on which files I downloaded is in this yaml file

serenalotreck commented 2 years ago

In terms of the aspect column, we annotate the data to our own ontologies (A, T, E) so those aspect symbols are different than the GO (P, F, C). Thus the GOA tools may not be appropriate for these files. You can see the annotations to the different ontologies and the aspects here: https://browser.planteome.org/amigo/search/annotation

With respect to this, is there an automated tool you would suggest using for parsing the GAF files, or should I just plan on writing my own?

elserj commented 2 years ago

Thank you again for bringing these up. I'm still digging in to them and checking all of our files. We mostly just use our files with our own fork of AmiGO, and it doesn't display the date so we didn't notice the issue with the TO dates. I will work on getting those fixed.

And you are right about those 2 Sorghum files having 16 and the Oryza having 18. Looks like I made some manual changes to those 3 files a while back causing the issue. I will get those fixed soon as well. I am also checking our other files to make sure the others don't have issues.

I will post again once I get those all fixed and check all the other files.

I don't really know of any other tools for dealing with gaf files. I see you submitted an issue to goatools asking about it, and that is probably the best option. This issue may also be relevant.

elserj commented 2 years ago

Ok, I think I fixed all the issues with columns I found. The ones in go_gene_Oryza_Gramene.assoc were only on commented lines, so they shouldn't have mattered anyway, but they are fixed in any case.

I did find a couple other files that had columns issues, but I believe they are all fixed now.

I also fixed the dates on the to_germplasm files.

I'm going to go ahead and close this, but feel free to reply or reopen if you find more issues.

serenalotreck commented 2 years ago

Thanks so much for your quick work on fixing those!

serenalotreck commented 2 years ago

I don't really know of any other tools for dealing with gaf files. I see you submitted an issue to goatools asking about it, and that is probably the best option. This issue may also be relevant.

Just wanted to update you that I submitted a PR that was accepted! goatools now supports reading GAF files with any Aspect code, so is useable with no problem on these files!

serenalotreck commented 2 years ago

I used wget to download the entirety of the database, and found a few more GAF files with issues:

po_anatomy_germplasm_Arabidopsis_NASC.assoc
to_gene_G.max_PPPP.assoc

These two throw UnicodeDecodeErrors for having invalid bytes in them.

Thanks!

elserj commented 2 years ago

Thanks again for pointing these out. It looks like those 2 files are in ISO-8859 encoding rather than UTF-8 or ASCII like the rest of them.

I've fixed the to_gene_G.max_PPPP.assoc file as it obviously just had a random extra character in it.

However, the po_anatomy_germplasm_Arabidopsis_NASC.assoc file has the illegal characters in a way that it isn't obvious it is wrong. I'm going to have to do some digging to figure out what is correct for that file. I will update here once I figure it out.

serenalotreck commented 2 years ago

Thank you! To my knowledge, these should be the only two left, as I've now processed all the files I downloaded, which should be everything from the SVN repo.

elserj commented 2 years ago

Should be fixed now. That file was 15 years old and no one had noticed that they didn't appear correctly on our site either, so thanks again for noticing and getting me to fix it.

serenalotreck commented 2 years ago

Thank you, glad to help!