AdmiralenOla / Scoary

Pan-genome wide association studies
GNU General Public License v3.0
147 stars 35 forks source link

Don't enforce "Non-unique gene name" and "Annotation" columns #57

Closed AdmiralenOla closed 7 years ago

AdmiralenOla commented 7 years ago

Remove enforcing of the columns "Non-unique gene name" and "Annotation" in the output. Some users might have input file with only a single identifier column (Gene ID) before sample info starts, and wants to run with -s 2.

In the current version, this will cause Scoary to fill in the "Non-unique Gene name" and "Annotation" columns with sample data. (Because it automatically assumes that this info can be found in columns 2 and 3). There is really no need to enforce any other columns than Gene ID.

dutchscientist commented 7 years ago

Actually, I was just about to suggest an alternative, allowing the user to specify column numbers to be included in the output (so I can see the gene numbers of specific strains in the dataset in the Scoary output).

I am now modifying the "Non-unique gene name" column for this and then split that one out.

AdmiralenOla commented 7 years ago

Hi! Trying to wrap my head around this, but I don't quite see how it would work. I think I'm confused by "gene numbers of specific strains in the dataset". Do you mean grabbing columns from the input Roary file or producing some kind of aggregate column? Would you mind giving an example?

dutchscientist commented 7 years ago

The way I envisage it is similar to the switch included that Scoary starts counting from column 15 in the Roary output.

Say that these are the headers from a Roary output:

Gene

Non-unique Gene name

Annotation

No. isolates

No. sequences

Avg sequences per isolate

Genome Fragment

Order within Fragment

Accessory Fragment

Accessory Order with Fragment

QC

Min group size nuc

Max group size nuc

Avg group size nuc

Sample1

Sample2

Sample3

The Scoary output will contain the first 3 columns followed by the counts, etc:

Gene

Non-unique gene name

Annotation

I would like to be able to have a switch where I can also include the information in the rows for Sample1, Sample2 and/or Sample3, something like "--columns_included 16,17,18". The group output of Roary is not always informative, the gene number can be.

Will see whether I can upload an example.

AdmiralenOla commented 7 years ago

OK, I think I understand what you mean now. Sure, I can implement that, should be fairly easy! I will schedule it for the next release.

dutchscientist commented 7 years ago

Cool! I aim to get you a lot of citations and help you increase your h-index ;-)

AdmiralenOla commented 7 years ago

Hi @dutchscientist. This functionality is included in the latest version. Hope you like it!

dutchscientist commented 7 years ago

Hi Ola, great! Will try it soon (currently travelling for a few weeks) :)

From: Ola Brynildsrud [mailto:notifications@github.com] Sent: 04 July 2017 00:14 To: AdmiralenOla/Scoary Scoary@noreply.github.com Cc: dutchscientist dutchscientist@gmail.com; Mention mention@noreply.github.com Subject: Re: [AdmiralenOla/Scoary] Don't enforce "Non-unique gene name" and "Annotation" columns (#57)

Hi @dutchscientisthttps://github.com/dutchscientist. This functionality is included in the latest version. Hope you like it!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/AdmiralenOla/Scoary/issues/57#issuecomment-312629837, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AJ8e0G6eC3c6rwHLlHNxcWfklaqGlGOWks5sKNsOgaJpZM4M6B8y.

dutchscientist commented 7 years ago

Yes, this is great! Exactly what I wanted, the --include_input_columns is just what I needed. Thanks very much!