NAL-i5K / NAL_RNA_seq_annotation_pipeline

Other
5 stars 3 forks source link

Change default name prefix #24

Closed mpoelchau closed 4 years ago

mpoelchau commented 4 years ago

It would be great if the final output file names could be changed to the format -

HsiuKangHuang commented 4 years ago

I noticed that there are two output.bam files. One is in the directory "rnannot/2020-04-09-22-52-00/" (the working directory) and the other one is in "rnannot/2020-04-09-22-52-00/SRR1793080" (the SRR folder under working directory) Do we need to change the one inside SRR1793080 too? or just the one at "rnannot/2020-04-09-22-52-00/"?

mpoelchau commented 4 years ago

I'm not sure how the files differ. Which one is the 'final' output file?

HsiuKangHuang commented 4 years ago

I think the one in "rnannot/2020-04-09-22-52-00/" (the working directory) is the final one. It is the merge file of each SRR output.bam file. [Line 325]

mpoelchau commented 4 years ago

Okay! For now, let's focus on the one in the working directory. Thanks!

HsiuKangHuang commented 4 years ago

In the tsv file, we don't have the info of [assemblyname] but I found a couple of Edirect commands can get assembly name by tax_id. https://www.biostars.org/p/346408/

I am trying to add it into tsv file; therefore, the rna_seq_annotate.py can use it as a parameter.

mpoelchau commented 4 years ago

That's worth a try. The person running the pipeline might have to add it themselves, though, since an organism can have multiple genome assemblies.

HsiuKangHuang commented 4 years ago

This is a part of output I got after I used Edirect commands. image

HsiuKangHuang commented 4 years ago

As the result shows, this tax id only has one assembly name. However, you're right. It may be a problem if there are more than one assembly name. The way that letting users add assembly name into the argument themselves may be better.

Maybe I can change the command from "RNAseq_annotate.py -i ./example/1049336.tsv -g ./Edan07162013.scaffolds.fa.gz -d" to something like this "RNAseqannotate.py -i 1049336.tsv -o [gggsss][assemblyname]RNA-Seq-alignments[date].bam -g Edan07162013.scaffolds.fa.gz -d"

HsiuKangHuang commented 4 years ago

I am thinking that will it be useful for users if I get the assembly name from NCBI and add it into tsv file? To add assembly name into the argument, if users don't know the name, they can get it or choose the one they want(if there are two or more assemblies) from tsv file.

mpoelchau commented 4 years ago

There are actually 2 assemblies in GenBank for that taxid (you can see them if you go to https://www.ncbi.nlm.nih.gov/assembly/?term=ephemera+danica and under 'status - latest' select 'clear'). I don't know how to retrieve all of them using edirect though. Hopefully there's a way.

It would be useful to have download_sra_metadata.py output the different assembly accessions and their names for a given taxid, so the person running the program could have that information readily available. I'm not sure if that should go in the tsv file - perhaps to stdout?

HsiuKangHuang commented 4 years ago

I found that this command "elink -db taxonomy -id 1049336 -target Assembly" will automatically activate filter to get the latest result. Therefore, even there are two results in taxonomy database, the output of elink can only get the latest one. The output looked like this. image

It seems like only using 'Assembly' as the target will activate the filter. If I use 'Popset' or 'Protein' as the target, the count of the result is corresponding to the amount shown in taxonomy database. I am still searching for solutions to get all of the results.

HsiuKangHuang commented 4 years ago

Users who want to change the merged file "output.bam" to "[gggsss]_[assembly name]RNA-Seq-alignments[yy_mm_dd].bam" can use this command: RNAseq_annotate.py -i ./example/1049336.tsv -g ./Edan07162013.scaffolds.fa.gz -a Edan_2.0 -d I added a new argument "-a" for users to add the organism assembly name for naming output file. The information of [gggsss] is from the scientific name in tsv file and the information of [assembly name] is from the argument added by user.

Now, "RNAseq_annotate.py -h" looks like this:

usage: RNAseq_annotate.py [-h] [-i INPUT] [-g GENOME] [-n [NAME]] [-o [OUTDIR]] [-a ASSEMBLY] [-d]Easy to use pipeline built for large-scale RNA-seq mapping with a genome assemblyoptional arguments: -h, --help show this help message and exit -i INPUT, --input INPUT A tsv file with a list of SRA runs' information. -g GENOME, --genome GENOME A fasta file to align with. -n [NAME], --name [NAME] name of the output folder, if not specified, use the time of start -o [OUTDIR], --outdir [OUTDIR] directory of output folder at, if not specified, use current folder -a ASSEMBLY, --assembly ASSEMBLY The assembly name is used for naming output file. -d, --downsample if specified, a downsampled bam file will be downsampled

mpoelchau commented 4 years ago

Looks good!