PATRIC3 / patric3_website

Legacy PATRIC Website (JBoss Portal Version)
MIT License
5 stars 2 forks source link

I lied to PATRIC & it liked it #1962

Closed aswarren closed 6 years ago

aswarren commented 6 years ago

Minhash supports fastq files. I told PATRIC that my fastq file was a contigs file so I could submit it to similar genome finder. It returned the correct result.

We should add type "reads" to allowable inputs in the similar genome finder.

mshukla1 commented 6 years ago

...or just call you a lier next time! ;)

olsonanl commented 6 years ago

This is a problem across the board. We have had a number of problems where uploaded data does not match the declared type.

I have working fastq and fasta validators that we can bolt into place. One issue, however, is that "reads" does not mean fastq; it can mean bam as well:

SPAdes takes as input paired-end reads, mate-pairs and single (unpaired) reads in FASTA and FASTQ. For IonTorrent data SPAdes also supports unpaired reads in unmapped BAM format (like the one produced by Torrent Server). However, in order to run read error correction, reads should be in FASTQ or BAM format. Sanger, Oxford Nanopore and PacBio CLR reads can be provided in both formats since SPAdes does not run error correction for these types of data.

There is a distinction between the file format and type of file as we are using it.

mshukla1 commented 6 years ago

Not sure if there is any short term action item here or we bite the bullet fix the file type / file format checking and make it consistently available across all services.

olsonanl commented 6 years ago

There is not a good hack fix to this.

aswarren commented 6 years ago

Regardless of what we do to clean up the filetype space, we will still have filetypes; apps will still be authorized to work with certain filetypes; this one should be authorized to work with fastq since it supports it. If you want to clean up filetypes, that is a separate issue/ticket.

aswarren commented 6 years ago

Hmm I think I need to walk that back. I forgot BAM, that is my bad. I will try to think about this.

aswarren commented 6 years ago

After looking at this further it looks like BAM is its own type right now. https://github.com/PATRIC3/Workspace/blob/master/typeslist.txt

image

So that isn't exactly an issue for this service. More broadly, for my money, exploding types like "reads" and "contigs" into their constituent formats and having those as our input types makes sense to me. I think if we did that then we could support the old "types" (reads, contigs) as we transition to no longer creating them.

I will try to create a proposed solution for this in a couple of slides so we can talk about it next week.

aswarren commented 6 years ago

Since BAM is not currently a "reads" type. This is enabled for Similar genome finder here. https://github.com/PATRIC3/p3_web/pull/773

aswarren commented 6 years ago

It looks like fastq only works when it is small and compressed fastq files don't seem to work even though MASH supports it. The invocation of MASH may need to be reworked for fastq: https://github.com/marbl/Mash/issues/32

olsonanl commented 6 years ago

How does it fail?

aswarren commented 6 years ago

@olsonanl Scratch that. It succeeded. We just were not setting the MASH distance threshold to be permissive enough. This seems to be common enough (along with the need to search all public genomes) that I'm wondering if we should be hiding the "Advanced" parameters here by default. They don't seem very "Advanced" and most people would want to know that they are searching only "Reference & Representative" without having to dig.