eileenwho / geobio_bioinformatics

code for building phylogenetic trees and dealing with files that are outputs and inputs to various genomic analysis workflows
0 stars 0 forks source link

clean up repository #2

Open eileenwho opened 6 years ago

eileenwho commented 6 years ago

go through and clarify what each script does make code more general make sure comments are clear Combine my previous file dealing codes into a multipurpose thing Options Type of file extension Edit or search Search for what Name of new file Edit replace what with what What do deflines look like or even copy in a block and have that be analyzed? And default

eileenwho commented 6 years ago

also add in more defensive programming

if possible, make sure that blast won't copy in sequences that are already in the file etc see notes on blast30ribo for additional things

eileenwho commented 6 years ago

1 speed up blast 2 make sure writing to output file doesn't take too much time 3 add option to check for redundant sequence names and delete thoughts on extractCopyProteinSeq priorities are: do blast and copy over in 1 step so there are just less files around with an output file that saves every line written to a file, name of that file, name of original file ok currently output file has everything could maybe get rid of a few lines in blast output thru https://www.ncbi.nlm.nih.gov/books/NBK279682/

try to speed up blast (multiple threads?) http://seqanswers.com/forums/showthread.php?t=26085 https://wiki.hpcc.msu.edu/display/Bioinfo/BLAST+with+Multiple+Processors http://voorloopnul.com/blog/how-to-correctly-speed-up-blast-using-num_threads/

add in notes about what to change for blastn blastx blastp or diff file format or only doing blast/ only doing copying over

also there are many notes on blast30ribo_extractCopyProteinSeq_withnotes.py

There's no reason to do blast and copy over in different steps

Send this separate files to people in case they want to do it separately

But can just copy directly into analysis-l1 file With output file so you have records of what the results were and can check in case of errors And terminal output because it takes a while Still can use temp blast But if there's something, copy in, if not don't Also add option for doing blastp and Blastn Look more at Syntax of tempblast file to be certain of things

add some output to run in terminal?

for now write "to only do one thing, comment out the relevant function" but later make it so that if you don't have -of or if you don't have -db you can just do only extract or only copy, want to make it easier for the user

notes written into the code earlier

better comments

add README/ internal README dONE

come up with a better name extractCopyProteinSeq DONE

clean up camel case vs _'s

could i get rid of tempblast?

WOULD LIKE TODO

speed it up possibly if the time doesn't come from the blasting process

add option to only do blast or only do copying over?

leave in ribo protein list so that's an extra level of flexibility

add output file for if sequences are a lot longer

to do checking, make a another file that's copied in from the first blast results

save 3 top typical lines

if copied in

or do both at the same time

if copied in line not within # for length

calc closeness of genome with different cutoffs of front or end cutting off by a certain

hmm 1 problem is a subst or deletion would throw things off

so maybe check against first 10 or last 10 of references

and if neither work just cut off from the end

anything for if it's too short?

make sure to be clear what inputs this wants and what it outputs aka where things are

i guess first things is if if it's significantly different

can print in terminal "note:" __ "has a sequence that is ___ longer than the first sequence in the file

would have to save that elsewhere don't think you can read an append file

or a thing where you check to see if this genome is already in the file b/c that would handle some errors

add option to trim deflines maybe

eileenwho commented 6 years ago

oh just realized that integrating the blast and the copying in sequences opens this up to more problems if it stops running in the middle b/c some things will be half blasted and copied hmmm make sure that you don't copy in sthg that's already there would be good but would also probably be slower oh wait you could check the output file and see where it stopped