WGS-TB / SplitStrains

4 stars 3 forks source link

Generating the fastq files #13

Open ibarilar opened 2 years ago

ibarilar commented 2 years ago

Hi,

I am having problems figuring out how to modify the gsc.sh(or whichever (other script) to run on my set of fastq files resulting in separated (in case of a mixture) fastq files.

Alternatively the same question but for a set of already indexed bam files.

Cheers, Ivan

lostdevfound commented 2 years ago

Hi Ivan, In gsc.sh set the variable split=1 (line 70) This will make SplitStrains classify the reads that support variants to belong either to minor or major strains. I suggest first running it on a small test file because read classification can take some time. The output will be two text files with names of the reads that belong to major or minor strains. Then use your pipeline or a pre-made script https://github.com/WGS-TB/SplitStrains/blob/master/rmreads.py to obtain bam files corresponding to minor and major strains.

lostdevfound commented 2 years ago

Also, if you are interested in fastq files for both strains, then use the output from SplitStrains (two text files with read names) to parse your original fastq files. This can be done using standard shell script methods, awk or any other tools.

ibarilar commented 2 years ago

Hi,

thank you for the quick answer.

I have gotten to the step where two number_strain.reads files are produced. Their content looks like this: NB502145:79:HJGH3BGXF:1:22103:19324:8688: c: 96: 4318: -203.8211294071598 -2640.695850496174

If i use those files on the original BAM file with rmreads.py I do not get the desired reads out. Using any of the two .reads files i get the identical .bam file as an output.

Cheers, Ivan

lostdevfound commented 2 years ago

Do you have just one read in the file? Here is an example of a typical file containing reads that were classified as reads supporting a minor strain. I posted only about 5% of the file content. The first column is the read name, the second column is the variant and its proportion, then the position, then log probabilities of each strain.

gi|strainB_100snp|-82294:   t:    40:     123971:    -1.626750605199371 -178.16116858158048 
gi|strainB_100snp|-606964:   t:    40:     123971:    -1.626750605199371 -178.16116858158048 
gi|strainB_100snp|-94034:   t:    40:     123971:    -1.626750605199371 -178.16116858158048 
gi|strainB_100snp|-1729786:   t:    40:     123971:    -1.626750605199371 -178.16116858158048 
gi|strainB_100snp|-198188:   t:    40:     123971:    -1.626750605199371 -178.16116858158048 
gi|strainB_100snp|-637110:   t:    40:     123971:    -1.626750605199371 -178.16116858158048 
gi|strainB_100snp|-1050824:   t:    40:     123971:    -1.626750605199371 -178.16116858158048 
gi|strainB_100snp|-1727200:   g:    51:     217352:    -39.73871948485717 -53.94321603163932 
gi|strainB_100snp|-1645374:   g:    51:     217352:    -39.73871948485717 -53.94321603163932 
gi|strainB_100snp|-808900:   c:    42:     254459:    -4.728932176030926 -137.17664555153 
gi|strainB_100snp|-1371622:   c:    42:     254459:    -4.728932176030926 -137.17664555153 
gi|strainB_100snp|-177568:   c:    42:     254459:    -4.728932176030926 -137.17664555153 
gi|strainB_100snp|-322078:   c:    42:     254459:    -4.728932176030926 -137.17664555153 
gi|strainB_100snp|-668980:   c:    42:     254459:    -4.728932176030926 -137.17664555153 
gi|strainB_100snp|-1623412:   c:    42:     254459:    -4.728932176030926 -137.17664555153 
gi|strainB_100snp|-1202850:   c:    42:     254459:    -4.728932176030926 -137.17664555153 

In this case, the minor strain constituted about 40% of the mixture (can be seen from the proportions of each letter)

lostdevfound commented 2 years ago

You can try this shell script

set -e
id=$1
trimQ=$2

# Path to your bam file
mainBam=${SAMPLE_PATH}/aligned/trimmed-${id}.sorted.bam
mkdir -p $SAMPLE_PATH/output/${id}_${trimQ}/typing # I don't think this line is necessary, it was used for classifying the type of the consensus sequences

# get output reads files from SplitStrains (you need to input your directory or use gsc.sh pipeline)
reads=( `ls $SAMPLE_PATH/output/${id}_${trimQ}/*.reads` )
# test if the files with the extension .reads were found by printing their names
echo ${reads[0]}
echo ${reads[1]}

# create major.bam and minor.bam by running rmreads.py
python rmreads.py ${reads[0]} $mainBam ${SAMPLE_PATH}/output/${id}_${trimQ}/major.bam
python rmreads.py ${reads[1]} $mainBam ${SAMPLE_PATH}/output/${id}_${trimQ}/minor.bam
# index bams
samtools index ${SAMPLE_PATH}/output/${id}_${trimQ}/major.bam
samtools index ${SAMPLE_PATH}/output/${id}_${trimQ}/minor.bam

# sort bam by name (needed for bedtools)
samtools sort -n -o ${SAMPLE_PATH}/output/${id}_${trimQ}/major.namesorted.bam ${SAMPLE_PATH}/output/${id}_${trimQ}/major.bam
samtools sort -n -o ${SAMPLE_PATH}/output/${id}_${trimQ}/minor.namesorted.bam ${SAMPLE_PATH}/output/${id}_${trimQ}/minor.bam

You can further obtain the consensus sequences for each strain (one long sequence that represents the genome of a strain) by checking the aforementioned script.

ibarilar commented 2 years ago

Hi,

I have a full file with multiple reads, gave you just the first row as an example of the output format. So they are how they should be.

My problem is that when i use either the major or the minor reads file with the rmreads.py i get the same bam file as an output. I thought this might mean that something is either wrong with the input format or the rmreads.py has a mistake in its filtering steps. The output bam is almost the same size as the original bam file meaning there is no filtering happening(at least not based on the major and minor read files).

Cheers, Ivan

lostdevfound commented 2 years ago

Hi Ivan, First, there are no mistakes in the script. It was used on more than a dozen of datasets from different countries. Since, you don't provide any information on the problem, I can't reproduce it.

However, I will share with you a bam file along with the .reads text files. You can download them from here and try running rmreads.py and see if that works on your setup.

If for some reason it doesn't work for you, you can always use a more conventional approach of filtering out reads by calling samtools directly as described in this thread: https://www.biostars.org/p/172737/

Also, how sure are you that no correct reads were removed? Did you try picking a read from the .read file, run the rmreads.py and then check if the read is still in the new bam file?

ibarilar commented 2 years ago

Hi,

Thanks for the files. Will try them out.

I tried the conventional approach from that exact thread, as well as some other ones(bbtools/filterbyname.sh) already. It also gave a wrong(identical) output. That is why i thought it could be due to some formatting difference between the minor/major reads file output and general bam file headers so that i could only work with rmreads.py. Although i did format the files by removing everything but the first(header) column.

Sorry, let me rephrase that last point. Reads are removed but in both cases(minor and major) the exact same bam file is generated. I did not try running it on a single read since my concern is first on the identical output in case of two different input files.

If you want here is a link to the output reads files, bam files made from them and original bam file that was used.

Cheers, Ivan

ibarilar commented 2 years ago

Update,

I ran your examples and i get two different files. From the original 267MB bam file i get a 261MB 81.bam file and 264MB 18.bam file. I ran: python3 rmreads.py 18_strain.reads trimmed-80.sorted.bam 18.bam and: python3 rmreads.py 81_strain.reads trimmed-80.sorted.bam 81.bam

If that is correct then something is different in the format of my input bam files.

Cheers, Ivan