Next steps - Githubissues

edeutsch commented 2 years ago

Let's make it so that it can take a command line argument of the FASTA file. Use argparse

python fasta_stats.py ../proteomes/maize/orginal/mitochondrion.2.fasta (shows the summary like you have now) python fasta_stats.py --show_duplicate_identifiers ../proteomes/maize/orginal/mitochondrion.2.fasta python fasta_stats.py --show_duplicate_sequences ../proteomes/maize/orginal/mitochondrion.2.fasta python fasta_stats.py --show_duplicate_descriptions ../proteomes/maize/orginal/mitochondrion.2.fasta

compare_fasta.py input_file_1 input_file_2 Determine:

how many distinct sequences in each file
how many overlapping sequences in each file
how many sequences unique to each file
how many overlapping identifiers in each file
of the overlapping identifiers, do their sequences match?

Make these reusable classes so we can build on this

Build a matrix of overlaps between the all the maize files file1 file2 file3 file1 1000 342 24 file2 600 599 file3 898

primarily by sequences

but also do a matric by identifier

python compare_fasta.py --by_sequence input_file_1 input_file_2 python compare_fasta.py --by_identifier input_file_1 input_file_2

MLi0411 commented 2 years ago

Hi, just to clarify:

"how many sequences unique to each file" is the only part where we are comparing between two files, right? For the rest, "distinct" and "overlapping" refer to distinct and redundant sequences/identifiers within one file? Thanks!

edeutsch commented 2 years ago

You're right, I did mix the concepts a little, but I would say that only the first ("how many distinct sequences in each file") is a property that can be computed from just one file on its own. The rest are all comparing two files. I worded it a bit poorly. Maybe this is better:

how many entries, repeated identifiers, distinct identifiers, repeated sequences, distinct sequences in each file (not a comparison)
how many overlapping sequences between the two files (i.e. how many sequences are in both files)
how many sequences unique to each file (e.g. file 1 has nn sequences not in file 2; file 2 has MM sequences not in file 1)
how many overlapping identifiers between the two files (i.e. how many identifiers are in both files)
how many identifiers unique to each file (e.g. file 1 has nn identifiers not in file 2; file 2 has MM identifiers not in file 1)
of the overlapping identifiers between the two files, for how many do their sequences match and for how many is it different?

Does that make it clearer?

Maybe you can think of other stats that would help us understand how two files are related.

MLi0411 commented 2 years ago

Yup, that makes sense. Thanks!

cashewballerz commented 2 years ago

I'm having an issue recognizing the file locally on my computer. Most of my code should work otherwise. I'll bring it up later today in the meeting. Also, with argparse, do you want this to be applicable to all the fasta files we currently have? Or just to mitochondria.2 to start?

edeutsch commented 2 years ago

okay, great! Yes, let's meet today at 5pm to discuss. You can use argparse to pick up runtime options as well as input files. So I was thinking something like this:

    argparser = argparse.ArgumentParser(description='description of program')
    argparser.add_argument('--n_threads', action='store', type=int, help='Set the number of files to process in parallel (defaults to number of cores)')
    argparser.add_argument('--refresh', action='count', default=0, help='If set, existing metadata for a file will be overwritten rather than skipping the file')
    argparser.add_argument('--verbose', action='count', help='If set, print more information about ongoing processing' )
    argparser.add_argument('files', type=str, nargs='+', help='Filenames of one or more mzML files to read')
    params = argparser.parse_args()

    #### Loop over all the files to ensure that they are really there before starting work
    for file in params.files:
        if not os.path.isfile(file):
            print(f"ERROR: File '{file}' not found or not a file")
            return

(this is lifted from a different program so the exact options don't make sense for what you're doing, but it shows how to specify options and filenames using argparse)

PlantProteomes / SeqComparison

Next steps #1